R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Course preparements, install libraaries and plot exercises

library(dslabs)

installed.packages()

library(dslabs) data(Teams) Warning message: In data(Teams) : data set ‘Teams’ not found data(teams) Warning message: In data(teams) : data set ‘teams’ not found data(murders) population Error: object ‘population’ not found murders$population

pop <- murders\(population length(pop) [1] 51 class(pop) [1] "numeric" class(murders\)state) [1] “character”

library(Lahman) Error in library(Lahman) : there is no package called ‘Lahman’ install.packages(“Lahman”)

install.packages(“tidyverse”)

library(Lahman) library(tidyverse)

Teams %>% filter(yearID %in% 1961:2001) %>% + mutate(HR_per_game = HR/G, R_per_game = R/G) %>% + lm(R_per_game ~ BB, .)

Call: lm(formula = R_per_game ~ BB, data = .)

Coefficients: (Intercept) BB
2.582186 0.003396

Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(AB_per_game = AB/G, R_per_game = R/G) %>% + ggplot(aes(R_per_game, AB_per_game)) + + geom_point()

Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(AB_per_game = AB/G, R_per_game = R/G) %>% + ggplot(aes(AB_per_game, R_per_game)) + + geom_line()

Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(AB_per_game = AB/G, R_per_game = R/G) %>% + ggplot(aes(AB_per_game, R_per_game)) + + geom_point(alpha = 0.5)

Teams %>% filter(yearID %in% 1961:2001 ) %>% + ggplot(aes(AB, R)) + + geom_point(alpha = 0.5)

Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(AB_per_game = AB/G, R_per_game = R/G) %>% + ggplot(aes(AB_per_game, R_per_game)) + + geom_point(alpha = 0.5)

?Teams

Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(AB_per_game = AB/G, R_per_game = R/G) %>% + ggplot(aes(AB_per_game, R_per_game)) + + geom_point(alpha = 0.5)

Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(number_of_wins_per_game = W/G, fielding_errors_per_game = E/G) %>% + ggplot(aes(number_of_wins_per_game, fielding_errors_per_game)) + + geom_point(alpha = 0.5)

Teams %>% filter(yearID %in% 1961:2001) %>% + mutate(win_rate = W / G, E_per_game = E / G) %>% + ggplot(aes(win_rate, E_per_game)) + + geom_point(alpha = 0.5)

Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(triple_per_game = X3B/G, double_per_game = X2B/G) %>% + ggplot(aes(triple_per_game, double_per_game)) + + geom_point(alpha = 0.5)

Course / Section 1: Introduction to Regression / 1.1: Baseball as a Motivating Example

Motivating Example: Moneyball

As motivation for this course, we’ll go back to 2002 and try to build a baseball team with a limited budget. Note that in 2002, the Yankees payroll was almost $130 million, and had more than tripled the Oakland A’s $40 million budget. [][Statistics have been used in baseball since its beginnings]. Note that the data set we will be using, included in the Lahman Library, goes back to the 19th century. For example, a summary of statistics we will describe soon, the batting average, has been used to summarize a batter’s success for decades. Other statistics such as home runs, runs batted in, and stolen bases, we’ll describe all this soon, are reported for each player in the game summaries included in the sports section of newspapers.

And players are rewarded for high numbers. Although summary statistics were widely used in baseball, data analysis per se was not. These statistics were arbitrarily decided on without much thought as to whether they actually predicted, or were related to helping a team win. This all changed with Bill James. In the late 1970s, this aspiring writer and baseball fan started publishing articles describing more in-depth analysis of baseball data. He named the approach of using data to predict what outcomes best predict if a team wins [][sabermetrics]. Until Billy Beane made sabermetrics the center of his baseball operations, Bill James’ work was mostly ignored by the baseball world.

Today, pretty much every team uses the approach, and it has gone beyond baseball into other sports. In this course, to simplify the example we use, we’ll focus on predicting scoring runs. We will ignore pitching and fielding, although those are important as well. We will see how regression analysis can help develop strategies to build a competitive baseball team with a constrained budget. [][The approach can be divided into two separate data analyses. In the first, we determine which recorded player specific statistics predict runs. In the second, we examine if players were undervalued based on what our first analysis predicts.]

[][Textbook link]

The corresponding section of the textbook is the case study on Moneyball. https://rafalab.github.io/dsbook/linear-models.html#case-study-moneyball

[][Key point]

Bill James was the originator of sabermetrics, the approach of using data to predict what outcomes best predicted if a team would win.

image here

image here

Baseball Basics

We actually don’t need to understand all the details about the game of baseball, which has over 100 rules, to see how regression will help us find undervalued players. Here, we distill the sport to the basic knowledge one needs to know to effectively attack the data science challenge. Let’s get started. [][The goal of a baseball game is to score more runs, they’re like points, than the other team]. Each team has nine batters that bat in a predetermined order. After the ninth batter hits, we start with the first again. Each time they come to bat, we call it a plate appearance, PA. At each plate appearance, the other team’s pitcher throws the ball and you try to hit it. The plate appearance ends with a binary outcome–you either make an out, that’s a failure and sit back down, or you don’t, that’s a success and you get to run around the bases and potentially score a run (So a run means batter from Home base to 2ed base or something???). Each team gets nine tries, referred to as innings, to score runs. Each inning ends after three outs, after you’ve failed three times.

From these examples, we see how luck is involved in the process. When you bat you want to hit the ball hard. If you hit it hard enough, it’s a home run, the best possible outcome as you get at least one automatic run. But sometimes, due to chance, you hit the ball very hard and a defender (who is this defender???) catches it, which makes it an out, a failure. In contrast, sometimes you hit the ball softly but it lands just in the right place. You get a hit which is a success. The fact that there is chance involved hints at why probability models will be involved in all this. Now there are [][several ways to succeed]. Understanding this distinction will be important for our analysis.

[][When you hit the ball you want to pass as many bases as possible]. There are four bases with the fourth one called home plate. Home plate is where you start, where you try to hit. So the bases form a cycle. [][If you get home, you score a run] (does this means you passes 4 bases one by one under one batter???). We’re simplifying a bit. But there are five ways you can succeed. In other words, not making an out. First one is called a base on balls. [][This is when the pitcher does not pitch well and you get to go to first base]. A single is when you hit the ball and you get to first base. A double is when you hit the ball and you go past first base to second. Triple is when you do that but get to third. And [][a home run is when you hit the ball and go all the way home and score a run]. [][If you get to a base, you still have a chance of getting home and scoring a run if the next batter hits successfully]. While you are on base, you can also try to [][steal a base]. If you run fast enough, you can try to go from first to second or from second to third without the other team tagging you.

All right. Now historically, the batting average has been considered the most important offensive statistic. To define this average, we define a [][hit] and an [][at bat]. Singles, doubles, triples, and home runs are [][hits]. But remember, there’s a fifth way to be successful, the base on balls. That is not a hit. [][An at bat] is the number of times you either get a hit or make an out, bases on balls are excluded. The batting average is simply hits divided by at bats. And it is considered the main measure of a success rate. Today, in today’s game, this success rates ranges from player to player from about 20% to 38%. We refer to the batting average in thousands. So for example, if your success rate is 25% we say you’re batting 250.

One of Bill James’ first important insights is that the [][batting average ignores bases on balls but bases on balls is a success]. So a player that gets many more bases on balls than the average player might not be recognized if he does not excel in batting average. But is this player not helping produce runs? No award is given to the player with the most bases on balls. In contrast, the total number of stolen bases are considered important and an award is given out to the player with the most. But players with high totals of stolen bases also make outs as they do not always succeed.

So does a player with a high stolen base total help produce runs? Can we use data size to determine if it’s better to pay for bases on balls or stolen bases? [][One of the challenges in this analysis is that it is not obvious how to determine if a player produces runs because so much depends on his teammates]. We do keep track of the number of runs scored by our player. But note that if you hit after someone who hits many home runs, you will score many runs (Super batter hit the ball far away thus you can run many bases as well, lucky player). But these runs don’t necessarily happen if we hire this player but not his home run hitting teammate. [][However, we can examine team level statistics] (How ???). How do teams with many stolen bases compare to teams with few? How about bases on balls? We have data. Let’s examine some.

[][Textbook link]

This video corresponds to the textbook section on baseball basics. https://rafalab.github.io/dsbook/linear-models.html#baseball-basics

[][Key points]

The goal of a baseball game is to score more runs (points) than the other team.
Each team has 9 batters who have an opportunity to hit a ball with a bat in a predetermined order. 
Each time a batter has an opportunity to bat, we call it a plate appearance (PA).
The PA ends with a binary outcome: the batter either makes an out (failure) and returns to the bench or the batter doesn’t (success) and can run around the bases, and potentially score a run (reach all 4 bases).
We are simplifying a bit, but there are five ways a batter can succeed (not make an out):

    Base on balls (BB): the pitcher fails to throw the ball through a predefined area considered to be hittable (the strike zone), so the batter is permitted to go to first base.
    Single: the batter hits the ball and gets to first base.
    Double (2B): the batter hits the ball and gets to second base.
    Triple (3B): the batter hits the ball and gets to third base.
    Home Run (HR): the batter hits the ball and goes all the way home and scores a run.

Historically, the batting average has been considered the most important offensive statistic. To define this average, we define a hit (H) and an at bat (AB). Singles, doubles, triples, and home runs are hits. The fifth way to be successful, a walk (BB), is not a hit. An AB is the number of times you either get a hit or make an out; BBs are excluded. The batting average is simply H/AB and is considered the main measure of a success rate.
Note: The video states that if you hit AFTER someone who hits many home runs, you will score many runs, while the textbook states that if you hit BEFORE someone who hits many home runs, you will score many runs. The textbook wording is accurate.

image here

image here

plate appearance

image here

In baseball, a home run (abbreviated HR) is scored when the ball is hit in such a way that the batter is able to circle the bases and reach home safely

image here

image here

Base on ball

A single is you hit the ball and get to first base

image here

image here

baseball home run, go all the way home and score a run

baseball steal a base

Image here

batting average equation

image here

Bases on Balls or Stolen Bases?

Let’s start looking at some baseball data and try to answer your questions using these data. First one, do teams that hit more home runs score more runs? We know what the answer to this will be, but let’s look at the data anyways. We’re going to examine data from 1961 to 2001. We end at 2001 because, remember, we’re back in 2002, getting ready to build a team.

We started in 1961, because that year, the league changed from 154 games to 162 games. The visualization of choice when exploring the relationship between two variables like home runs and runs is a scatterplot (So we can do what we prefer I suppose). The following code shows you how to make that scatterplot. We start by loading the Lahman library that has all these baseball statistics. And then we simply make a scatterplot using 2d plot. Here’s a plot of runs per game versus home runs per game.

The plot shows a very strong association–teams with more home runs tended to score more runs. Now, let’s examine the relationship between stolen bases and wins. Here are the runs per game plotted against stolen bases per game. Here, the relationship is not as clear. Finally, let’s examine the relationship between bases on balls and runs. Here are runs per game versus bases on balls per game. Although the relationship is not as strong as it was for home runs, we do see a pretty strong relationship here.

We know that, by definition, home runs cause runs, because when you hit a home run, at least one run will score. [][Need to stufy how baseball rules the scores or runs, is Runs and Scores the same stuff in baseball game ???] Now it could be that home runs also cause the bases on balls (How ??? base on balls. This is when the pitcher does not pitch well and you get to go to first base). If you understand the game, you will agree with me that that could be the case. [][So it might appear that a base on ball is causing runs, when in fact, it’s home runs that’s causing both]. This is called [][confounding]. An important concept you will learn about. Linear regression will help us parse all this out and quantify the associations. This will then help us determine what players to recruit. Specifically, we will try to predict things like how many more runs will the team score if we increase the number of bases on balls but keep the home runs fixed. Regression will help us answer this question, as well.

[][Textbook link]

This video corresponds to the base on balls or stolen bases textbook section. https://rafalab.github.io/dsbook/linear-models.html#base-on-balls-or-stolen-bases

[][Key points]

The visualization of choice when exploring the relationship between two variables like home runs and runs is a scatterplot.

Code: Scatterplot of the relationship between HRs and wins

library(Lahman) library(tidyverse) library(dslabs) ds_theme_set()

Teams %>% filter(yearID %in% 1961:2001) %>% mutate(HR_per_game = HR / G, R_per_game = R / G) %>% ggplot(aes(HR_per_game, R_per_game)) + geom_point(alpha = 0.5)

Code: Scatterplot of the relationship between stolen bases and wins

Teams %>% filter(yearID %in% 1961:2001) %>% mutate(SB_per_game = SB / G, R_per_game = R / G) %>% ggplot(aes(SB_per_game, R_per_game)) + geom_point(alpha = 0.5)

Code: Scatterplot of the relationship between bases on balls and runs

Teams %>% filter(yearID %in% 1961:2001) %>% mutate(BB_per_game = BB / G, R_per_game = R / G) %>% ggplot(aes(BB_per_game, R_per_game)) + geom_point(alpha = 0.5)

Image here

library(Lahman)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

#ggplot2::ds_theme_set()
Teams %>%
  filter(yearID %in% 1961:2001) %>%
  mutate(HR_per_game = HR/G, R_per_game = R/G) %>%
  ggplot2::ggplot(aes(HR_per_game, R_per_game)) +
  geom_point(alpha=0.5)

Image here

Image here

image here

caused the both

image here

[][Textbook link]

This video corresponds to the base on balls or stolen bases textbook section. https://rafalab.github.io/dsbook/linear-models.html#base-on-balls-or-stolen-bases

[][Key points]

The visualization of choice when exploring the relationship between two variables like home runs and runs is a scatterplot.

Code: Scatterplot of the relationship between HRs and wins

library(Lahman) library(tidyverse) library(dslabs) ds_theme_set()

Teams %>% filter(yearID %in% 1961:2001) %>% mutate(HR_per_game = HR / G, R_per_game = R / G) %>% ggplot(aes(HR_per_game, R_per_game)) + geom_point(alpha = 0.5)

Code: Scatterplot of the relationship between stolen bases and wins

Teams %>% filter(yearID %in% 1961:2001) %>% mutate(SB_per_game = SB / G, R_per_game = R / G) %>% ggplot(aes(SB_per_game, R_per_game)) + geom_point(alpha = 0.5)

Code: Scatterplot of the relationship between bases on balls and runs

Teams %>% filter(yearID %in% 1961:2001) %>% mutate(BB_per_game = BB / G, R_per_game = R / G) %>% ggplot(aes(BB_per_game, R_per_game)) + geom_point(alpha = 0.5)

Assessment: Baseball as a Motivating Example

Comprehension Check due May 29, 2022 00:29 AWST Completed

Question 1

1/1 point (graded) What is the application of statistics and data science to baseball called? Moneyball Sabermetrics The “Oakland A’s Approach” There is no specific name for this; it’s just data science.

Question 2

1/1 point (graded) Which of the following outcomes is not included in the batting average? A home run A base on balls An out A single

Question 3

1/1 point (graded) Why do we consider team statistics as well as individual player statistics? The success of any individual player also depends on the strength of their team. Team statistics can be easier to calculate. The ultimate goal of sabermetrics is to rank teams, not players.

Question 4

1.0/1.0 point (graded) You want to know whether teams with more at-bats per game have more runs per game. What R code below correctly makes a scatter plot for this relationship?

Teams %>% filter(yearID %in% 1961:2001 ) %>% ggplot(aes(AB, R)) + geom_point(alpha = 0.5)

Teams %>% filter(yearID %in% 1961:2001 ) %>% mutate(AB_per_game = AB/G, R_per_game = R/G) %>% ggplot(aes(AB_per_game, R_per_game)) + geom_point(alpha = 0.5)

Teams %>% filter(yearID %in% 1961:2001 ) %>% mutate(AB_per_game = AB/G, R_per_game = R/G) %>% ggplot(aes(AB_per_game, R_per_game)) + geom_line()

Teams %>% filter(yearID %in% 1961:2001 ) %>% mutate(AB_per_game = AB/G, R_per_game = R/G) %>% ggplot(aes(R_per_game, AB_per_game)) + geom_point()

Question 5

1.0/1.0 point (graded) What does the variable “SOA” stand for in the Teams table?

Hint: make sure to use the help file (?Teams). sacrifice out slides or attempts strikeouts by pitchers accumulated singles

Question 6

1/1 point (graded)

Load the Lahman library. Filter the Teams data frame to include years from 1961 to 2001. Make a scatterplot of runs per game versus at bats (AB) per game. Which of the following is true? There is no clear relationship between runs and at bats per game. As the number of at bats per game increases, the number of runs per game tends to increase. As the number of at bats per game increases, the number of runs per game tends to decrease.

Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(AB_per_game = AB/G, R_per_game = R/G) %>% + ggplot(aes(AB_per_game, R_per_game)) + + geom_point(alpha = 0.5)

Question 7

0/1 point (graded)

Use the filtered Teams data frame from Question 6. Make a scatterplot of win rate (number of wins per game) versus number of fielding errors (E) per game. Which of the following is true? There is no relationship between win rate and errors per game. As the number of errors per game increases, the win rate tends to increase. As the number of errors per game increases, the win rate tends to decrease.This is the answer

library(dplyr)   # this is for pipe %>%
library(ggplot2)
library(Lahman) 
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.6     v purrr   0.3.4
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
Teams %>% filter(yearID %in% 1961:2001 ) %>% 
  mutate(number_of_wins_per_game = W/G, fielding_errors_per_game = E/G) %>%
  ggplot(aes(number_of_wins_per_game, fielding_errors_per_game)) + geom_point(alpha = 0.3) 

# Explanation

When you examine the scatterplot, you can see a clear trend towards decreased win rate with increasing number of errors per game (before I wa using big scatter markersize). The following code can be used to make the scatterplot:

      Teams %>% filter(yearID %in% 1961:2001) %>%

mutate(win_rate = W / G, E_per_game = E / G) %>% ggplot(aes(win_rate, E_per_game)) + geom_point(alpha = 0.5)

Teams %>% summarise(cor(W/G, E/G))
##   cor(W/G, E/G)
## 1    -0.2158873

Can we really call this as a relationship ??? Really???

Question 8

1/1 point (graded)

Use the filtered Teams data frame from Question 6. Make a scatterplot of triples (X3B) per game versus doubles (X2B) per game. Which of the following is true? There is no clear relationship between doubles per game and triples per game. As the number of doubles per game increases, the number of triples per game tends to increase. As the number of doubles per game increases, the number of triples per game tends to decrease.

Teams %>% filter(yearID %in% 1961:2001 ) %>% + mutate(triple_per_game = X3B/G, double_per_game = X2B/G) %>% + ggplot(aes(triple_per_game, double_per_game)) + + geom_point(alpha = 0.5)

Questions About Baseball as a Motivating Example?

Ask your questions or make your comments about Baseball as a Motivating Example here! Remember, one of the best ways to reinforce your own learning is by explaining something to someone else, so we encourage you to answer each other’s questions (without giving away the answers, of course).

Some reminders:

Search the discussion board before posting to see if someone else has asked the same thing before asking a new question Please be specific in the title and body of your post regarding which question you’re asking about to facilitate answering your question. Posting snippets of code is okay, but posting full code solutions is not. If you do post snippets of code, please format it as code for readability. If you’re not sure how to do this, there are instructions in a pinned post in the “general” discussion forum.

Course / Section 1: Introduction to Regression / 1.2: Correlation

library(HistData) Error in library(HistData) : there is no package called ‘HistData’ install.packages(“HistData”)

library(HistData) data(“GaltonFamilies”)

galton_heights <- GaltonFamilies %>% + filter(childNum == 1 & gender == ‘male’) %>% + select(father, childHeight) %>% + rename(son = childHeight)

galton_heights %>% + summarise(mean(father), sd(father), mean(son), sd(son)) mean(father) sd(father) mean(son) sd(son) 1 69.09888 2.546555 70.45475 2.557061

galton_heights %>% + ggplot(aes(father, son)) + + geom_point(alpha=0.5)

Correlation

Up to now in this series, we have focused mainly on univariate variables. However, in data science application it is very common to be interested in the relationship between two or more variables. (Google this topic and explore a case study) We saw this in our baseball example in which we were interested in the relationship, for example, between bases on balls and runs. we’ll come back to this example, but we introduce the concepts of correlation and regression using a simpler example.

We’ll create a data set with the heights of fathers and the first sons. The actual data Galton used to discover and define regression. So we have the father and son height data. Suppose we were to summarize these data. Since both distributions are well approximated by normal distributions, we can use the two averages and two standard deviations as summaries. Here they are.

However, this summary fails to describe a very important characteristic of the data that you can see in this figure. The trend that the taller the father, the taller the son, is not described by the summary statistics of the average and the standard deviation. We will learn that the correlation coefficient is a summary of this trend.(Interesting here how the instructor jumped in the topic of his teaching)

[][Textbook link]

The corresponding textbook section is Case Study: is height hereditary? https://rafalab.github.io/dsbook/regression.html#case-study-is-height-hereditary

[][Key points]

    Galton tried to predict sons' heights based on fathers' heights.
    The mean and standard errors are insufficient for describing an important characteristic of the data: the trend that the taller the father, the taller the son.
    The correlation coefficient is an informative summary of how two variables move together that can be used to predict one variable using the other.

Code

create the dataset

library(tidyverse) library(HistData) data(“GaltonFamilies”) set.seed(1983) galton_heights <- GaltonFamilies %>% filter(gender == “male”) %>% group_by(family) %>% sample_n(1) %>% ungroup() %>% select(father, childHeight) %>% rename(son = childHeight)

means and standard deviations

galton_heights %>% summarize(mean(father), sd(father), mean(son), sd(son))

scatterplot of father and son heights

galton_heights %>% ggplot(aes(father, son)) + geom_point(alpha = 0.5)

library(dplyr)
library(Lahman)
library(HistData)

data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
  filter(childNum == 1 & gender == 'male') %>%
  select(father, childHeight) %>%
  rename(son=childHeight)


galton_heights %>%
  summarise(mean(father), sd(father), mean(son), sd(son))
##   mean(father) sd(father) mean(son)  sd(son)
## 1     69.09888   2.546555  70.45475 2.557061
galton_heights %>% 
  ggplot(aes(father, son)) +
  geom_point(alpha=0.5)

Correlation Coefficient

The correlation coefficient is defined for a list of pairs–x1, y1 through xn, yn– with the following formula. Here, mu x and mu y are the averages of x and y, respectively. And sigma x and sigma y are the standard deviations. The Greek letter rho is commonly used in the statistics book, to denote this correlation. The reason is that rho is the Greek letter for r, the first letter of the word regression. Soon, we will learn about the connection between correlation and regression. [][To understand why this equation does, in fact, summarize how two variables move together], consider the i-th entry of x is xi minus mu x divided by sigma x SDs away from the average. Similarly, the yi– which is paired with the xi–is yi minus mu y divided by sigma y SDs away from the average y.

If x and y are unrelated, then the product of these two quantities will be positive. That happens when they are both positive or when they are both negative as often as they will be negative. That happens when one is positive and the other is negative, or the other way around. One is negative and the other one is positive. [][This will average to about 0. The correlation is this average.]

And therefore, unrelated variables will have a correlation of about 0 (Why ??? Sorry I didn't get it ). If instead the quantities vary together, then we are averaging mostly positive products. Because they’re going to be either positive times positive or negative times negative. And we get a positive correlation. If they vary in opposite directions, we get a negative correlation.

Another thing to know is that we can show mathematically that the correlation is always between negative 1 and 1. To see this, consider that we can have higher correlation than when we compare a list to itself. That would be perfect correlation. In this case, the correlation is given by this equation, which we can show is equal to 1. A similar argument with x and its exact opposite, negative x, proves that the correlation has to be greater or equal to negative 1. So it’s between minus 1 and 1.

To see what data looks like for other values of rho, here are six examples of pairs with correlations ranging from negative 0.9 to 0.99. When the correlation is negative, we see that they go in opposite direction. As x increases, y decreases. When the correlation gets either closer to 1 or negative 1, we see the clot of points getting thinner and thinner. When the correlation is 0, we just see a big circle of points.

[][Textbook link]

This video corresponds to the correlation coefficient section of the textbook. https://rafalab.github.io/dsbook/regression.html#the-correlation-coefficient

[][Key points]

    The correlation coefficient is defined for a list of pairs 
    
        (x_1, y_1), ..., (x_n, y_n)
    
    as the product of the standardized values:
    
        ((x_i - mu_x)/Sigma_x) * ((y_i - mu_y)/Sigma_y)

    .
    The correlation coefficient essentially conveys how two variables move together.
    The correlation coefficient is always between -1 and 1.

Code

rho <- mean(scale(x)*scale(y)) galton_heights %>% summarize(r = cor(father, son)) %>% pull(r)

alt text here.

alt text here.

alt text here.

alt text here.

alt text here.

alt text here.

alt text here.

https://www.mathsisfun.com/data/standard-deviation.html

alt text here.

library(Lahman)
library(dplyr)
library(HistData)

data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
  filter(childNum == 1 & gender == "male") %>%
  select(father, childHeight) %>%
  rename(son = childHeight)

galton_heights %>% summarise(cor(father, son))
##   cor(father, son)
## 1        0.5007248

alt text here.

Sample Correlation is a Random Variable

Before we continue describing regression, let’s go over a reminder about random variability. In most data science applications, we do not observe the population, but rather a sample. As with the average and standard deviation, the sample correlation is the most commonly used estimate of the population correlation. This implies that the correlation we compute and use as a summary is a random variable (So What ???). As an illustration, let’s assume that the 179 pairs of fathers and sons is our entire population. A less fortunate geneticist can only afford to take a random sample of 25 pairs. The sample correlation for this random sample can be computed using this code. Here, the variable R is the random variable.

[][We can run a monte-carlo simulation to see the distribution of this random variable]. Here, we recreate R 1000 times, and plot its histogram. We see that the expected value is the population correlation, the mean of these Rs is 0.5, and that it has a relatively high standard error relative to its size, SD 0.147. This is something to keep in mind when interpreting correlations. [][It is a random variable, and it can have a pretty large standard error]. Also note that because the sample correlation is an average of independent draws (independent? average? how?), the Central Limit Theorem actually applies. [][Therefore, for a large enough sample size N, the distribution of these Rs is approximately normal].

The expected value we know is the population correlation. The standard deviation is somewhat more complex to derive, but this is the actual formula here. In our example, N equals to 25, does not appear to be large enough to make the approximation a good one (how to identify a good one ??? Should the standard deviation equal to a normal distribution or something ???), as we see in this QQ-plot.

[][Textbook link]

This video corresponds to the textbook section titled: Sample correlation is a random variable. https://rafalab.github.io/dsbook/regression.html#sample-correlation-is-a-random-variable

[][Key points]

    The correlation that we compute and use as a summary is a random variable.
    When interpreting correlations, it is important to remember that correlations derived from samples are estimates containing uncertainty.
    Because the sample correlation is an average of independent draws, the central limit theorem applies. 

Code

compute sample correlation

R <- sample_n(galton_heights, 25, replace = TRUE) %>% summarize(r = cor(father, son)) R

Monte Carlo simulation to show distribution of sample correlation

B <- 1000 N <- 25 R <- replicate(B, { sample_n(galton_heights, N, replace = TRUE) %>% summarize(r = cor(father, son)) %>% pull(r) }) qplot(R, geom = “histogram”, binwidth = 0.05, color = I(“black”))

expected value and standard error

mean(R) sd(R)

QQ-plot to evaluate whether N is large enough

data.frame(R) %>% ggplot(aes(sample = R)) + stat_qq() + geom_abline(intercept = mean(R), slope = sqrt((1-mean(R)^2)/(N-2)))

library(Lahman)
library(dplyr)
library(HistData)
library(stats)

data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
  filter(childNum == 1 & gender == "male") %>%
  select(father, childHeight) %>%
  rename(son = childHeight)


set.seed(0)
R <- sample_n(galton_heights, 25, replace=TRUE) %>%
  summarize(cor(father, son))

R
##   cor(father, son)
## 1        0.5889351
library(dplyr)
library(ggplot2)

B <- 1000
N <- 25

R <- replicate(B, {
  sample_n(galton_heights, N, replace = TRUE) %>%
    summarize(r = cor(father, son)) %>% .$r
})
# ========================================================================================================
# Using $ Operator to Access Data Frame Column. 
# Using . to do what

data.frame(R) %>% 
  ggplot(aes(R)) + geom_histogram(binwidth=0.05, color='black')

mean(R)
## [1] 0.5040874
sd(R)
## [1] 0.1439084
library(Lahman)
library(dplyr)
library(HistData)
library(stats)

data("GaltonFamilies")
galton_heights <- GaltonFamilies %>%
  filter(childNum == 1 & gender == "male") %>%
  select(father, childHeight) %>%
  rename(son = childHeight)


set.seed(0)
R <- sample_n(galton_heights, 25, replace=TRUE) %>%
  summarize(mean(father), sd(father), mean(son), sd(son))

R
##   mean(father) sd(father) mean(son)  sd(son)
## 1       69.232   2.302303    70.632 2.107273

this implies that the correlation we compte and used as a summary is a random variable.png

because the sample correlation is an average of independent draws, the centeral limit theorem applies.png

[][Read this and think] Sample Correlation is a Random Variable.png

alt text here.

alt text here

Assessment: Correlation

Question 1

1/1 point (graded) While studying heredity, Francis Galton developed what important statistical concept? Standard deviation Normal distribution Correlation Probability

Question 2

1/1 point (graded) The correlation coefficient is a summary of what? The trend between two variables The dispersion of a variable The central tendency of a variable The distribution of a variable correct

Question 3

1/1 point (graded)

Below is a scatter plot showing the relationship between two variables, x and y. Scatter plot of relationship between x (plotted on the x-axis) and y (plotted on the y-axis). y-axis values range from -3 to 3; x-axis values range from -3 to 3. Points are fairly well distributed in a tight band with a range from approximately (-2, 2) to (3, -3).

Question 3 image here From this figure, the correlation between x and y appears to be about: -0.9 -0.2 0.9 2

Question 4

1/1 point (graded)

Instead of running a Monte Carlo simulation with a sample size of 25 from the 179 father-son pairs described in the videos, we now run our simulation with a sample size of 50. Would you expect the mean of our sample correlation to increase, decrease, or stay approximately the same? Increase Decrease Stay approximately the same

Question 5

1/1 point (graded)

Instead of running a Monte Carlo simulation with a sample size of 25 from the 179 father-son pairs described in the videos, we now run our simulation with a sample size of 50. Would you expect the standard deviation of our sample correlation to increase, decrease, or stay approximately the same? Increase Decrease Stay approximately the same

Question 6

1/1 point (graded) If X and Y are completely independent, what do you expect the value of the correlation coefficient to be? -1 -0.5 0 0.5 1 Not enough information to answer the question

Question 7

1/1 point (graded)

Load the Lahman library. Filter the Teams data frame to include years from 1961 to 2001. What is the correlation coefficient between number of runs per game and number of at bats per game? correct 0.6580976

Loading You have used 1 of 10 attempts Some

Question 8

1/1 point (graded)

Use the filtered Teams data frame from Question 7. What is the correlation coefficient between win rate (number of wins per game) and number of errors per game? correct -0.3396947

Loading You have used 1 of 10 attempts Some

Question 9

1/1 point (graded)

Use the filtered Teams data frame from Question 7. What is the correlation coefficient between doubles (X2B) per game and triples (X3B) per game? correct -0.01157404

Loading You have used 1 of 10 attempts Some

library(Lahman)
library(dplyr)
library(ggplot2)
library(tidyverse)

Teams %>% filter(yearID %in% 1961:2001 ) %>% 
  mutate(number_of_runs_per_game = R/G, number_of_bats_per_game = AB/G) %>%
  ggplot2::ggplot(aes(number_of_runs_per_game, number_of_bats_per_game)) + geom_point(alpha = 0.5) 

# https://stackoverflow.com/questions/60901319/r-language-registered-s3-method-overwritten-by-data-table
library(Lahman)
library(dplyr)
library(ggplot2)
library(tidyverse)

Teams %>% filter(yearID %in% 1961:2001 ) %>% 
  summarize(cor(R/G, AB/G))
##   cor(R/G, AB/G)
## 1      0.6580976
library(Lahman)
library(dplyr)
library(ggplot2)
library(tidyverse)

Teams %>% filter(yearID %in% 1961:2001 ) %>% 
  summarize(cor(W/G, E/G))
##   cor(W/G, E/G)
## 1    -0.3396947
library(Lahman)
library(dplyr)
library(ggplot2)
library(tidyverse)

Teams %>% filter(yearID %in% 1961:2001 ) %>% 
  summarize(cor(X2B/G, X3B/G))
##   cor(X2B/G, X3B/G)
## 1       -0.01157404

Questions About Correlation?

Ask your questions or make your comments about Correlation here! Remember, one of the best ways to reinforce your own learning is by explaining something to someone else, so we encourage you to answer each other’s questions (without giving away the answers, of course).

Some reminders:

    Search the discussion board before posting to see if someone else has asked the same thing before asking a new question
    Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
    Posting snippets of code is okay, but posting full code solutions is not.
    If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.

Course / Section 1: Introduction to Regression / 1.3: Stratification and Variance Explained

Anscombe’s Quartet/Stratification

Correlation is not always a good summary of the relationship between two variables. A famous example used to illustrate this are the following for artificial data sets, referred to as Anscombe’s quartet. All of these pairs have a correlation of 0.82. Correlation is only meaningful in a particular context. To help us understand when it is that correlation is meaningful as a summary statistic, we’ll try to predict the son’s height using the father’s height. This will help motivate and define linear regression. We start by demonstrating how correlation can be useful for prediction.

[][Suppose we are asked to guess the height of a randomly selected son]. Because of the distribution of the son height is approximately normal, we know that the average height of 70.5 inches is a value with the highest proportion and would be the prediction with the chances of minimizing the error. [][But what if we are told that the father is 72 inches?] Do we still guess 70.5 inches for the son? The father is taller than average, specifically he is 1.14 standard deviations taller than the average father. So shall we predict that the son is also 1.14 standard deviations taller than the average son? It turns out that this would be an overestimate.

To see this, we look at all the sons with fathers who are about 72 inches. We do this by stratifying the father’s height. We call this a conditional average, since we are computing the average son height conditioned on the father being 72 inches tall. A challenge when using this approach in practice is that we don’t have many fathers that are exactly 72. In our data set, we only have eight. If we change the number to 72.5, we would only have one father who is that height. This would result in averages with large standard errors, and they won’t be useful for prediction for this reason. But for now, what we’ll do is we’ll take an approach of creating strata of fathers with very similar heights. Specifically, we will round fathers’ heights to the nearest inch. This gives us the following prediction for the son of a father that is approximately 72 inches tall. We can use this code and get our answer, which is 71.84. This is 0.54 standard deviations larger than the average son, a smaller number than the 1.14 standard deviations taller that the father was above the average father. Stratification followed by box plots lets us see the distribution of each group. Here is that plot.

We can see that the centers of these groups are increasing with height, not surprisingly. The means of each group appear to follow a linear relationship (then why we supposed to use a scatter plot to explore relationship between two variales in Exploratory Data Analysis in Python DataCamp course ? I mean many dots covering on each other, Using a boxplot would avoid this condition ???). We can make that plot like this, with this code. See the plot and notice that this appears to follow a line. The slope of this line appears to be about 0.5, which happens to be the correlation between father and son heights. This is not a coincidence. To see this connection, let’s plot the standardized heights (Why??? standardize? can we just use the boxplot or the grouped mean?) against each other, son versus father, with a line that has a slope equal to the correlation. (Think, Think, Think) [][****Read the important comments in below standardized plot, and think about how its all related to each other****]

Here’s the code. Here’s a plot. This line is what we call the regression line. In a later video, we will describe Galton’s theoretical justification for using this line to estimate conditional means. Here, we define it and compute it for the data at hand. The regression line for two variables, x and y, tells us that [][for every standard deviation (sigma x) increase above the average (mu x). For x, y grows rho standard deviations (sigma y) above the average (mu y).]. The formula for the regression line is therefore this one (Think, Think, Think)[][How does this comes out???]. If there’s perfect correlation, we predict an increase that is the same number of SDs. If there’s zero correlation, then we don’t use x at all for the prediction of y. For values between 0 and 1, the prediction is somewhere in between. If the correlation is negative, we predict a reduction, instead of an increase.

It is because when the correlation is positive but lower than the one, that we predict something closer to the mean (it has to be the normal distribution, enough sized sample), that we call this regression. The son regresses to the average height.

In fact, the title of Galton’s paper was “Regression Towards Mediocrity in Hereditary Stature.” Note that if we write this in the standard form of a line, y equals b plus mx, where b is the intercept and m is the slope, the regression line has slope rho times sigma y, divided by sigma x, and intercept mu y, minus mu x, times the slope. So if we standardize the variable so they have average 0 and standard deviation 1. Then the regression line has intercept 0 and slope equal to the correlation rho. Let’s look at the original data, father son data, and add the regression line. We can compute the intercept and the slope using the formulas we just derived. Here’s a code to make the plot with the regression line. If we plot the data in standard units, then, as we discussed, the regression line as intercept 0 and slope rho. Here’s the code to make that plot.

We started this discussion by saying that we wanted to use the conditional means to predict the heights of the sons. But then we realized that there were very few data points in each strata. When we did this approximation of rounding off the height of the fathers (the boxplot), we found that these conditional means appear to follow a line. And we ended up with the regression line (that is if we standardize both variables - father, son). ****So the regression line gives us the prediction****. An advantage of using the regression line is that we used all the data to estimate just two parameters, the slope and the intercept. This makes it much more stable. When we do conditional means, we had fewer data points, which made the estimates have a large standard error, and therefore be unstable. So this is going to give us a much more stable prediction using the regression line. However, are we justified in using the regression line to predict? Galton gives us the answer.

[][Textbook link]

There are three links to relevant sections of the textbook for this video:

    correlation is not always a useful summary
    https://rafalab.github.io/dsbook/regression.html#correlation-is-not-always-a-useful-summary
    conditional expectation
    https://rafalab.github.io/dsbook/regression.html#conditional-expectation
    the regression line
    https://rafalab.github.io/dsbook/regression.html#the-regression-line

[][Key points]

    Correlation is not always a good summary of the relationship between two variables.
    The general idea of conditional expectation is that we stratify a population into groups and compute summaries in each group.
    A practical way to improve the estimates of the conditional expectations is to define strata of with similar values of x.
    If there is perfect correlation, the regression line predicts an increase that is the same number of SDs for both variables. If there is 0 correlation, then we don’t use x at all for the prediction and simply predict the average . For values between 0 and 1, the prediction is somewhere in between. If the correlation is negative, we predict a reduction instead of an increase.

Code

number of fathers with height 72 or 72.5 inches

sum(galton_heights\(father == 72) sum(galton_heights\)father == 72.5)

predicted height of a son with a 72 inch tall father

conditional_avg <- galton_heights %>% filter(round(father) == 72) %>% summarize(avg = mean(son)) %>% pull(avg) conditional_avg

stratify fathers’ heights to make a boxplot of son heights

galton_heights %>% mutate(father_strata = factor(round(father))) %>% ggplot(aes(father_strata, son)) + geom_boxplot() + geom_point()

center of each boxplot

galton_heights %>% mutate(father = round(father)) %>% group_by(father) %>% summarize(son_conditional_avg = mean(son)) %>% ggplot(aes(father, son_conditional_avg)) + geom_point()

calculate values to plot regression line on original data

mu_x <- mean(galton_heights\(father) mu_y <- mean(galton_heights\)son) s_x <- sd(galton_heights\(father) s_y <- sd(galton_heights\)son) r <- cor(galton_heights\(father, galton_heights\)son) m <- r * s_y/s_x b <- mu_y - m*mu_x

add regression line to plot

galton_heights %>% ggplot(aes(father, son)) + geom_point(alpha = 0.5) + geom_abline(intercept = b, slope = m)

image here

image here

image here

image here

image here

conditional_avg <- galton_heights %>%
  filter(round(father)==72) %>%
  summarise(avg=mean(son)) %>% .$avg

conditional_avg
## [1] 71.83571
galton_heights %>%
  mutate(father_strata = factor(round(father))) %>%
  ggplot2::ggplot(aes(father_strata, son)) +
  geom_boxplot() +
  geom_point()

galton_heights %>%
  mutate(father=round(father)) %>%
  group_by(father) %>%
  summarise(son_conditional_avg = mean(son)) %>%
  ggplot2::ggplot(aes(father, son_conditional_avg)) +
  geom_point()

image here

r <- galton_heights %>%  summarise(r = cor(father, son)) %>% .$r

galton_heights %>%
  mutate(father=round(father)) %>%
  group_by(father) %>%
  summarise(son = mean(son)) %>%
  mutate(z_father = scale(father), z_son = scale(son)) %>%
  ggplot2::ggplot(aes(z_father, z_son)) +
  geom_point() +
  geom_abline(intercept = 0, slope = r)

# **Why do we use scale function in R?**
# When we want to scale the values in several columns of a data frame so that each column has a mean of 0 and a standard deviation of 1, we usually use the scale() function.  
r <- galton_heights %>%  summarise(r = cor(father, son)) %>% .$r

galton_heights %>%
  mutate(father=round(father)) %>%
  group_by(father) %>%
  summarise(son = mean(son)) %>%
  mutate(z_father = father, z_son = son) %>%
  ggplot2::ggplot(aes(z_father, z_son)) +
  geom_point() +
  geom_abline(intercept = 35.5, slope = r)

# So the reason for standardize is the intercept is easy to define, or we have to guess until its 35 or something,  and also the two plot seems different, this one and the standardized one.  Recall how correlation is calculated, and how scale() function standardizing the data: (x - mean(x)) / sd(x)

image here

a <- galton_heights %>%
  mutate(father=round(father)) %>%
  group_by(father) %>%
  summarise(son = mean(son)) %>%
  mutate(z_father = scale(father), z_son = scale(son))
# https://stackoverflow.com/questions/20256028/understanding-scale-in-r

a
## # A tibble: 15 x 4
##    father   son z_father[,1] z_son[,1]
##     <dbl> <dbl>        <dbl>     <dbl>
##  1     62  65.2       -1.70     -2.14 
##  2     64  68.1       -1.28     -1.00 
##  3     65  67.6       -1.06     -1.21 
##  4     66  69.2       -0.850    -0.566
##  5     67  70.0       -0.638    -0.262
##  6     68  69.2       -0.425    -0.572
##  7     69  71.2       -0.213     0.231
##  8     70  71.2        0         0.198
##  9     71  71.5        0.213     0.341
## 10     72  71.8        0.425     0.469
## 11     73  71.5        0.638     0.350
## 12     74  75.2        0.850     1.82 
## 13     75  71.2        1.06      0.204
## 14     76  73.5        1.28      1.13 
## 15     78  73.2        1.70      1.01

Read the below statement

image here

image here

image here

image here

image here

mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
s_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)

m <- r * s_y / s_x
b <- mu_y - m * mu_x
galton_heights %>%
  ggplot2::ggplot(aes(father, son)) +
  geom_point(alpha=0.3) +
  geom_abline(intercept = b, slope=m)

galton_heights %>%
  ggplot2::ggplot(aes(scale(father), scale(son))) +
  geom_point(alpha=0.3) +
  geom_abline(intercept = 0, slope = r)

image here

Bivariate Normal Distribution

Correlation and the regression line are widely used summary statistics. [][But it is often misused or misinterpreted_*]. Anscombe’s example provided toy example of data sets in which summarizing with a correlation would be a mistake. But we also see it in the media and in scientific literature as well.

The main way we motivate the use of correlation involve what is called the bivariate normal distribution. When a pair of random variables is approximated by a bivariate normal distribution, the scatterplot looks like ovals, like American footballs. They can be thin. That’s when they have high correlation. All the way up to a circle shape when they have no correlation. We saw some examples previously. Here they are again. A more technical way to define the bivariate normal distribution is the following. First, this distribution is defined for pairs (remember the father - son paris we used earlier). So we have two variables, x and y. And they have paired values. They are going to be bivariate normally distributed if the following happens.

If x is a normally distributed random variable, and y is also a normally distributed random variable–and for any grouping of x that we can define, say, with x being equal to some predetermined value, which we call here in this formula little x–then the y’s in that group are approximately normal as well. If this happens, then the pair is approximately bivariate normal. When we fix x in this way, we then refer to the resulting distribution of the y’s in the group– defined by setting x in this way– as the conditional distribution of y given x. (Remember we did this before, fix father 72 inches, but the sample size is too small) We write the notation like this for the conditional distribution and the conditional expectation. If we think the height data is well-approximated by the bivariate normal distribution, then we should see the normal approximation hold for each grouping.

Here, we stratify the son height by the standardized father heights and see that the assumption appears to hold. Here’s the code that gives us the desired plot. Now, we come back to defining correlation. Galton showed– using mathematical statistics– that when two variables follow a bivariate normal distribution, then for any given x the expected value of the y in pairs for which x is set at that value is mu y plus rho x minus mu x divided by sigma x times sigma y. Note that this is a line with slope rho times sigma y divided by sigma x and intercept mu y minus n times mu x. And therefore, this is the same as the regression line we saw in a previous video. (Now you must understand the slope difference between standalized vairable and non-standarilized vaiable in father son case we did in prevrious chapter) That can be written like this. So in summary, if our data is approximately bivariate, then the conditional expectation–which is the best prediction for y given that we know the value of x–is given by the regression line.

[][Textbook link]

This video corresponds to the textbook section on the bivariate normal distribution (advanced). https://rafalab.github.io/dsbook/regression.html#bivariate-normal-distribution-advanced

[][Key points]

    When a pair of random variables are approximated by the bivariate normal distribution, scatterplots look like ovals. They can be thin (high correlation) or circle-shaped (no correlation).
    When two variables follow a bivariate normal distribution, computing the regression line is equivalent to computing conditional expectations.
    We can obtain a much more stable estimate of the conditional expectation by finding the regression line and using it to make predictions.

Code

galton_heights %>% mutate(z_father = round((father - mean(father)) / sd(father))) %>% filter(z_father %in% -2:2) %>% ggplot() +
stat_qq(aes(sample = son)) + facet_wrap( ~ z_father)

image here

image here

image here

image here

images here

images here

image here

image here

image here

galton_heights %>%
  #mutate(z_father=round((father-mean(father))/sd(father))) %>%
  mutate(z_father=round(scale(father))) %>%  # Do you recall this
  filter(z_father %in% -2:2) %>%
  ggplot2::ggplot() +
  stat_qq(aes(sample=son)) + 
  facet_wrap(~z_father)

# What does a QQ plot show? 
# The purpose of the quantile-quantile (QQ) plot is to show if two data sets come from the same distribution. Plotting the first data set's quantiles along the x-axis and plotting the second data set's quantiles along the y-axis is how the plot is constructed.
# https://math.illinois.edu/system/files/inline-files/Proj9AY1516-report2.pdf

image here # Where does above formula comes from???

image here

image here

image here

image here

image here

Variance Explained

Correction (So this video is recorded long times ago, how interesting)

The equation shown at 0:10 is for the standard deviation of the conditional distribution, not the variance. **The variance is the standard deviation squared**. See the notes below the video for more clarification. Variance is standard deviation squared (The theory we’ve been describing also tells us that the standard deviation of the conditional distribution that we described in a previous video is Var (Correction: The equation shown at 0:10 is for the standard deviation of the conditional distribution, not the variance) of Y given X equals sigma y times the square root of 1 minus rho squared. This is where statements like x explains such and such percent of the variation in y comes from (Why saying this ??? Where does this equation comes from ???[][Google this topic]). Note that the variance of y is sigma squared. That’s where we start. If we condition on x, then the variance goes down to 1 minus rho squared times sigma squared y [][Now what??? How did we get that equation???]. So from there, we can compute how much the variance has gone down. It has gone down by rho squared times 100%[][What are you talking about???]. So the correlation and the amount of variance explained are related to each other. But it is important to remember that the variance explained statement only makes sense when the data is by a bivariate normal distribution.

[][Read all the course material and thinking, then trying to answer your own questions, its a good way of learning, and thinking]

[][Textbook link]

This video corresponds to the textbook section on variance explained. # https://rafalab.github.io/dsbook/regression.html#variance-explained

[][Key points]

    Conditioning on a random variable X can help to reduce variance of response variable Y.
    The standard deviation of the conditional distribution is 
    
        SD(Y|X=x) = Sigma_y * SquaredRoot(1 - rho**2), 
    
    which is smaller than the standard deviation without conditioning sigma_y.
    Because variance is the standard deviation squared, the variance of the conditional distribution is:
    
        Var(Y\X=x) = Sigma_y^(2) * (1 - rho**2).
    
    In the statement "X explains such and such percent of the variability," the percent value refers to the variance. The variance decreases by \(\rho^2\) percent.
    The “variance explained” statement only makes sense when the data is approximated by a bivariate normal distribution.

not the variance # Where does above equation comes from ???

image here

image here

image here

image here

image here

There are Two Regression Lines

We computed a regression line to predict the son’s height from the father’s height. We used these calculations– here’s the code–to get the slope and the intercept. This gives us the function that the conditional expectation of y given x is 35.7 plus 0.5 times x.

So, what if we wanted to predict the father’s height based on the son’s? It is important to know that this is not determined by computing the inverse function of what we just saw, which would be this equation here. [][We need to compute the expected value of x given y]. This gives us another regression function altogether, with slope and intercept computed like this. (How did this comes from??) So now we get that the expected value of x given y, or the expected value of the father’s height given the son’s height, is equal to 34 plus 0.5 y, a different regression line.

So in summary, it’s important to remember that the regression line comes from computing expectations, and these give you two different lines, depending on if you compute the expectation of y given x or x given y.

[][Textbook link]

The link to the corresponding section of the textbook is warning: there are two regression lines. https://rafalab.github.io/dsbook/regression.html#warning-there-are-two-regression-lines

[][Key point] There are two different regression lines depending on whether we are taking the expectation of Y given X or taking the expectation of X given Y. Code

compute a regression line to predict the son’s height from the father’s height

mu_x <- mean(galton_heights\(father) mu_y <- mean(galton_heights\)son) s_x <- sd(galton_heights\(father) s_y <- sd(galton_heights\)son) r <- cor(galton_heights\(father, galton_heights\)son) m_1 <- r * s_y / s_x b_1 <- mu_y - m_1*mu_x

compute a regression line to predict the father’s height from the son’s height

m_2 <- r * s_x / s_y b_2 <- mu_x - m_2*mu_y

mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
x_y <- sd(galton_heights$son)

r <- cor(galton_heights$father, galton_heights$son)
m <- r * s_y/s_x    # Thus the variance should be changed to variance**2, why made mistakes, poor Harvard
b <- mu_y - m*mu_x

m
## [1] 0.5027904
b
## [1] 35.71249

Song predict father wrong

compute expected value of x given y

m <- r * s_x/s_y
b <- mu_x - m*mu_y

m
## [1] 0.4986676
b
## [1] 33.96539

image here

two different lines depending on what you do

Assessment: Stratification and Variance Explained, Part 1

Question 1

Look at the figure below. question one image Scatter plot of son and father heights with son heights on the y-axis and father heights on the x-axis. There is also a regression line that runs from roughly (63,66) to (78,76). The dots on the plot are scattered around the line. The slope of the regression line in this figure is equal to what, in words? Slope = (correlation coefficient of son and father heights) * (standard deviation of sons’ heights / standard deviation of fathers’ heights) Slope = (correlation coefficient of son and father heights) * (standard deviation of fathers’ heights / standard deviation of sons’ heights) Slope = (correlation coefficient of son and father heights) / (standard deviation of sons’ heights * standard deviation of fathers’ heights) Slope = (mean height of fathers) - (correlation coefficient of son and father heights * mean height of sons).

Question 2

1 point possible (graded) Why does the regression line simplify to a line with intercept zero and slope rho when we standardize our x and y variables?

Try the simplification on your own first! When we standardize variables, both x and y will have a mean of one and a standard deviation of zero. When you substitute this into the formula for the regression line, the terms cancel out until we have the following equation: y_i = rho * x_i. When we standardize variables, both x and y will have a mean of zero and a standard deviation of one. When you substitute this into the formula for the regression line, the terms cancel out until we have the following equation: y_i = rho * x_i. When we standardize variables, both x and y will have a mean of zero and a standard deviation of one. When you substitute this into the formula for the regression line, the terms cancel out until we have the following equation: y_i = rho + x_i.

Question 3

1 point possible (graded) What is a limitation of calculating conditional means?

Select ALL that apply. Each stratum we condition on (e.g., a specific father’s height) may not have many data points. Because there are limited data points for each stratum, our average values have large standard errors. Conditional means are less stable than a regression line. Conditional means are a useful theoretical tool but cannot be calculated.

Question 4

1/1 point (graded) A regression line is the best prediction of Y given we know the value of X when: X and Y follow a bivariate normal distribution. Both X and Y are normally distributed. Both X and Y have been standardized. There are at least 25 X-Y pairs.

Question 5

0/1 point (graded) Which one of the following scatterplots depicts an x and y distribution that is NOT well-approximated by the bivariate normal distribution?

I chose 3, but false, why.

Explanation

The v-shaped distribution of points from the first plot means that the x and y variables do not follow a bivariate normal distribution.

When a pair of random variables is approximated by a bivariate normal, the scatter plot looks like an oval (as in the 2nd, 3rd, and 4th plots) - [][it is okay if the oval is very round (as in the 3rd plot) or long and thin (as in the 4th plot)].

Question 5 image 1 Question 5 image 2 Question 5 image 3 Question 5 image 4 last one

Question 6

0/1 point (graded)

We previously calculated that the correlation coefficient between fathers’ and sons’ heights is 0.5. Given this, what percent of the variation in sons’ heights is explained by fathers’ heights? 0% 25% 50% 75% incorrect I choose 50% which is incorrect, Think, Think, Think Answer Incorrect: Try again. [][When two variables follow a bivariate normal distribution, the variation explained can be calculated as rho^2 x 100](How does this comes from???).

Question 7

1/1 point (graded)

Suppose the correlation between father and son’s height is 0.5, the standard deviation of fathers’ heights is 2 inches, and the standard deviation of sons’ heights is 3 inches. Given a one inch increase in a father’s height, what is the predicted change in the son’s height? 0.333 0.5 0.667 0.75 1 1.5 correct Answer Correct: Correct! TThe slope of the regression line is calculated by multiplying the correlation coefficient by the ratio of the standard deviation of son heights and standard deviation of father heights: var_son/var_father. (Note: here he means SD_son/SD_father, its a mistake) .

Assessment: Stratification and Variance Explained, Part 2

Comprehension Check due May 29, 2022 00:29 AWST

In the second part of this assessment, you’ll analyze a set of mother and daughter heights, also from GaltonFamilies.

Define female_heights, a set of mother and daughter heights sampled from GaltonFamilies, as follows:

set.seed(1989) #if you are using R 3.5 or earlier set.seed(1989, sample.kind=“Rounding”) #if you are using R 3.6 or later library(HistData) data(“GaltonFamilies”)

female_heights <- GaltonFamilies%>%
filter(gender == “female”) %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(mother, childHeight) %>%
rename(daughter = childHeight)

set.seed(1989, sample.kind="Rounding") #if you are using R 3.6 or later
## Warning in set.seed(1989, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
library(HistData)
data("GaltonFamilies")

female_heights <- GaltonFamilies %>%     
    filter(gender == "female") %>%     
    group_by(family) %>%     
    sample_n(1) %>%     
    ungroup() %>%     
    select(mother, childHeight) %>%     
    rename(daughter = childHeight)


mean(female_heights$mother)
## [1] 64.125
sd(female_heights$mother)
## [1] 2.289292
mean(female_heights$daughter)
## [1] 64.28011
sd(female_heights$daughter)
## [1] 2.39416
cor(female_heights$mother, female_heights$daughter)
## [1] 0.3245199
mu_x <- mean(female_heights$mother)
mu_y <- mean(female_heights$daughter)
s_x <- sd(female_heights$mother)
s_y <- sd(female_heights$daughter)

r <- cor(female_heights$mother, female_heights$daughter)

m <- r * s_y/s_x
b <- mu_y - m*mu_x

m
## [1] 0.3393856
b
## [1] 42.51701
m*m
## [1] 0.1151826
outcome <- m * 60 + b
outcome
## [1] 62.88015
***Recall what we did in above courses***
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
x_y <- sd(galton_heights$son)

r <- cor(galton_heights$father, galton_heights$son)

m <- r * s_y/s_x    # Thus the variance should be changed to variance**2, why made mistakes, poor Harvard
b <- mu_y - m*mu_x

Question 8

5/5 points (graded)

Calculate the mean and standard deviation of mothers’ heights, the mean and standard deviation of daughters’ heights, and the correlaton coefficient between mother and daughter heights. Mean of mothers’ heights correct 64.125

Loading Standard deviation of mothers’ heights correct 2.289292

Loading Mean of daughters’ heights correct 64.28011

Loading Standard deviation of daughters’ heights correct 2.39416

Loading Correlation coefficient correct 0.3245199

Loading You have used 1 of 10 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Correct (5/5 points)

Question 9

3/3 points (graded)

Calculate the slope and intercept of the regression line predicting daughters’ heights given mothers’ heights. Given an increase in mother’s height by 1 inch, how many inches is the daughter’s height expected to change? Slope of regression line predicting daughters’ height from mothers’ heights correct 0.3393856

Loading Intercept of regression line predicting daughters’ height from mothers’ heights correct 42.51701

Loading Change in daughter’s height in inches given a 1 inch increase in the mother’s height correct 0.3393856

Loading You have used 1 of 10 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Correct (3/3 points)

Question 10

1/1 point (graded) What percent of the variability in daughter heights is explained by the mother’s height?

Report your answer as a value between 0 and 100. Do NOT include the percent symbol (%) in your submission. correct 11

Loading You have used 4 of 10 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Correct (1/1 point)

Question 11

1/1 point (graded)

A mother has a height of 60 inches. Using the regression formula, what is the conditional expected value of her daughter’s height given the mother’s height? correct 62.88015

Loading You have used 1 of 10 attempts Some problems have options such as save, reset, hints, or show answer. These options follow the Submit button. Correct (1/1 point)

Questions About Stratification and Variance Explained?

Ask your questions or make your comments about Stratification and Variance Explained here! Remember, one of the best ways to reinforce your own learning is by explaining something to someone else, so we encourage you to answer each other’s questions (without giving away the answers, of course).

Some reminders:

    Search the discussion board before posting to see if someone else has asked the same thing before asking a new question
    Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
    Posting snippets of code is okay, but posting full code solutions is not.
    If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.

Course / Section 2: Linear Models / Linear Models Overview

Linear Models Overview

In the Linear Models section, you will learn how to do linear regression.

After completing this section, you will be able to:

    Use multivariate regression to adjust for confounders.
    Write linear models to describe the relationship between two or more variables.
    Calculate the least squares estimates for a regression model using the lm function.
    Understand the differences between tibbles and data frames.
    Use the do() function to bridge R functions and the tidyverse.
    Use the tidy(), glance(), and augment() functions from the broom package.
    Apply linear regression to measurement error models.

This section has four parts: Introduction to Linear Models, Least Squares Estimates, Tibbles, do, and broom, and Regression and Baseball. There are comprehension checks at the end of each part, along with an assessment on linear models at the end of the whole section for Verified learners only.

We encourage you to use R to interactively test out your answers and further your own learning. If you get stuck, we encourage you to search the discussion boards for the answer to your issue or ask us for help!

Course / Section 2: Linear Models / 2.1: Introduction to Linear Models

Confounding: Are BBs More Predictive?

In a previous video, we found that the slope of the regression line for predicting runs from bases on balls was 0.735. So, does this mean that if we go and hire low salary players with many bases on balls that increases the number of walks per game by 2 for our team? Our team will score 1.47 more runs per game? [][We are again reminded that association is not causation]. The data does provide strong evidence that a team with 2 more bases on balls per game than the average team scores 1.47 more runs per game, but this does not mean that bases on balls are the cause. If we do compute the regression line slope for singles, we get 0.449, a lower value. Note that a single gets you to first base just like a base on balls. Those that know a little bit more about baseball will tell you that with a single, runners that are on base have a better chance of scoring than with a base on balls (DId you see the logic conflict here).

So, how can base on balls be more predictive of runs? The reason this happens is because of [][confounding]. Note the correlation between homeruns, bases on balls, and singles. We see that the correlation between bases on balls and homeruns is quite high compared to the other two pairs. [][It turns out that pitchers, afraid of homeruns, will sometimes avoid throwing strikes to homerun hitters]. As a result, homerun hitters tend to have more bases on balls. Thus, a team with many homeruns will also have more bases on balls than average, and as a result, it may appear that bases on balls cause runs. But it is actually the homeruns that caused the runs.

In this case, we say that bases on balls are confounded with homeruns. [][But could it be that bases on balls still help? To find out, we somehow have to adjust for the homerun effect. Regression can help with this]. =====================================================================================

library(dplyr )
library(Lahman)
library(ggplot2)
#library(tidyverse)
#library(dslabs)

Teams %>% 
  filter(yearID %in% 1961:2001) %>%
  mutate(Singles = (H - HR - X2B - X3B)/G, BB = BB/G, HR = HR/G) %>%
  summarize(cor(BB, HR), cor(Singles, HR), cor(BB, Singles))
##   cor(BB, HR) cor(Singles, HR) cor(BB, Singles)
## 1   0.4039313       -0.1737435      -0.05603822

Why [][Did you see the logic conflict here ??? Single-R slope < BB-R slope, but Single gives runner better chance. Now interesting]++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Stratification and Multivariate Regression

[][To try to determine if bases on balls is still useful for creating runs, a first approach is to keep home runs fixed at a certain value and then examine the relationship between runs and bases on balls]. As we did when we stratified fathers by rounding to the closest inch, here, we can stratify home runs per game to the closest 10th. We filtered our strata with few points. We use this code to generate an informative data set. And then, we can make a scatter plot for each strata. A scatterplot of runs versus bases on balls. This is what it looks like. Remember that the regression slope for predicting runs with bases on balls when we ignore home runs was 0.735.

[][But once we stratify by home runs, these slopes are substantially reduced]. We can actually see what the slopes are by using this code. We stratify by home run and then compute the slope using the formula that we showed you previously. These values are closer to the slope we obtained from singles, which is 0.449. Which is more consistent with our intuition. Since both singles and bases on ball get us to first base, they should have about the same predictive power.

Now, although our understanding of the application– our understanding of baseball– tells us that home runs cause bases on balls and not the other way around, we can still check if, after stratifying by base on balls, we still see a home run effect or if it goes down. We use the same code that we just used for bases on balls. But now, we swap home runs for bases on balls to get this plot. In this case, the slopes are the following. You can see they are all around 1.5, 1.6, 1.7. So they do not change that much from the original slope estimate, which was 1.84.

[][Regardless, it seems that if we stratify by home runs, we have an approximately bivariate normal distribution for runs versus bases on balls. Similarly, if we stratify by bases on balls, we have an approximately normal bivariate distribution for runs versus home runs. So what do we do?] It is somewhat complex to be computing regression lines for each strata. We’re essentially fitting this model that you can see in this equation with the slopes for x1 changing for different values of x2 and vice versa. Here, x1 is bases on balls. And x2 are home runs. Is there an easier approach? (So we have to take whole influence aerospace into a equation???)

[][Note that if we take random variability into account, the estimated slopes by strata don’t appear to change that much] (Sorry what ???). If these slopes are in fact the same, this implies that this function beta 1 of x2 and the other function beta 2 of x1 are actually constant. Which, in turn, implies that the expectation of runs condition on home runs and bases on balls can be written in this simpler model. This model implies that if the number of home runs is fixed, we observe a linear relationship between runs and bases on balls. And that the slope of that relationship does not depend on the number of home runs. Only the slope changes as the home runs increase. (Someone asked a question here) [][The statement “Only the slope changes as the home runs increase” is actually incorrect. We have since modified the script and re-filmed the video, though editing for the updated video has not been completed yet. The correct, updated text will be “Only the INTERCEPT changes”. I will add a comment on the video page now until the updated video is uploaded.] Thingking and asking and answering.

[][The same is true if we swap home runs and bases on balls. In this analysis, referred to as multivariate regression, we say that the bases on balls slope beta 1 is adjusted for the home run effect. If this model is correct, then confounding has been accounted for. But how do we estimate beta 1 and beta 2 from the data? For this, we’ll learn about linear models and least squares estimates.]

[][Textbook link]

This video corresponds to the textbook section on multivariate regression. https://rafalab.github.io/dsbook/linear-models.html#multivariate-regression

[][Key points]

    A first approach to check confounding is to keep HRs fixed at a certain value and then examine the relationship between BB and runs.
    The slopes of BB after stratifying on HR are reduced, but they are not 0, which indicates that BB are helpful for producing runs, just not as much as previously thought.
    NOTE: There is an error in the script. The quote "Only the slope changes as the home runs increase." will be corrected to "Only the INTERCEPT changes as the home runs increase." in a future version of the video.

Code

stratify HR per game to nearest 10, filter out strata with few points

dat <- Teams %>% filter(yearID %in% 1961:2001) %>% mutate(HR_strata = round(HR/G, 1), BB_per_game = BB / G, R_per_game = R / G) %>% filter(HR_strata >= 0.4 & HR_strata <=1.2)

scatterplot for each HR stratum

dat %>% ggplot(aes(BB_per_game, R_per_game)) +
geom_point(alpha = 0.5) + geom_smooth(method = “lm”) + facet_wrap( ~ HR_strata)

calculate slope of regression line after stratifying by HR

dat %>%
group_by(HR_strata) %>% summarize(slope = cor(BB_per_game, R_per_game)*sd(R_per_game)/sd(BB_per_game))

stratify by BB

dat <- Teams %>% filter(yearID %in% 1961:2001) %>% mutate(BB_strata = round(BB/G, 1), HR_per_game = HR / G, R_per_game = R / G) %>% filter(BB_strata >= 2.8 & BB_strata <=3.9)

scatterplot for each BB stratum

dat %>% ggplot(aes(HR_per_game, R_per_game)) +
geom_point(alpha = 0.5) + geom_smooth(method = “lm”) + facet_wrap( ~ BB_strata)

slope of regression line after stratifying by BB

dat %>%
group_by(BB_strata) %>% summarize(slope = cor(HR_per_game, R_per_game)*sd(R_per_game)/sd(HR_per_game))

dat <- Teams %>% 
  filter(yearID %in% 1961:2001) %>%
  mutate(HR_strata = round(HR/G, 1), 
         BB_per_game = BB/G, 
         R_per_game = R/G) %>%
  filter(HR_strata >= 0.4 & HR_strata <= 1.2)

head(dat)
##   yearID lgID teamID franchID divID Rank   G Ghome  W  L DivWin WCWin LgWin
## 1   1961   AL    BAL      BAL  <NA>    3 163    82 95 67   <NA>  <NA>     N
## 2   1961   AL    BOS      BOS  <NA>    6 163    82 76 86   <NA>  <NA>     N
## 3   1961   AL    CHA      CHW  <NA>    4 163    81 86 76   <NA>  <NA>     N
## 4   1961   NL    CHN      CHC  <NA>    7 156    78 64 90   <NA>  <NA>     N
## 5   1961   NL    CIN      CIN  <NA>    1 154    77 93 61   <NA>  <NA>     Y
## 6   1961   AL    CLE      CLE  <NA>    5 161    81 78 83   <NA>  <NA>     N
##   WSWin   R   AB    H X2B X3B  HR  BB   SO  SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1     N 691 5481 1393 227  36 149 581  902  39 30  NA NA 588 526 3.22 54  21 33
## 2     N 729 5508 1401 251  37 112 647  847  56 36  NA NA 792 687 4.29 35   6 30
## 3     N 765 5556 1475 216  46 138 550  612 100 40  NA NA 726 653 4.06 39   3 33
## 4     N 689 5344 1364 238  51 176 539 1027  35 25  NA NA 800 689 4.48 34   6 25
## 5     N 710 5243 1414 247  35 158 423  761  70 33  NA NA 653 575 3.78 46  12 40
## 6     N 737 5609 1493 257  39 150 492  720  34 11  NA NA 752 665 4.15 35  12 23
##   IPouts   HA HRA BBA SOA   E  DP    FP              name              park
## 1   4413 1226 109 617 926 126 173 0.980 Baltimore Orioles  Memorial Stadium
## 2   4326 1472 167 679 831 143 140 0.977    Boston Red Sox    Fenway Park II
## 3   4344 1491 158 498 814 128 138 0.980 Chicago White Sox     Comiskey Park
## 4   4155 1492 165 465 755 183 175 0.970      Chicago Cubs     Wrigley Field
## 5   4110 1300 147 500 829 134 124 0.977   Cincinnati Reds     Crosley Field
## 6   4329 1426 178 599 801 139 142 0.977 Cleveland Indians Cleveland Stadium
##   attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro HR_strata BB_per_game
## 1     951089  96  96      BAL            BAL         BAL       0.9    3.564417
## 2     850589 102 103      BOS            BOS         BOS       0.7    3.969325
## 3    1146019  99  97      CHW            CHA         CHA       0.8    3.374233
## 4     673057 101 104      CHC            CHN         CHN       1.1    3.455128
## 5    1117603 102 101      CIN            CIN         CIN       1.0    2.746753
## 6     725547  97  98      CLE            CLE         CLE       0.9    3.055901
##   R_per_game
## 1   4.239264
## 2   4.472393
## 3   4.693252
## 4   4.416667
## 5   4.610390
## 6   4.577640
dat %>%
  ggplot2::ggplot(aes(BB_per_game, R_per_game)) +
  geom_point(alpha=0.3) +
  geom_smooth(formula = y ~ x, method = "lm") +
  facet_wrap(~HR_strata)

dat %>%
  group_by(HR_strata) %>%
  summarise(slope = cor(BB_per_game, R_per_game) * sd(R_per_game) / sd(BB_per_game))
## # A tibble: 9 x 2
##   HR_strata slope
##       <dbl> <dbl>
## 1       0.4 0.734
## 2       0.5 0.566
## 3       0.6 0.412
## 4       0.7 0.285
## 5       0.8 0.365
## 6       0.9 0.261
## 7       1   0.512
## 8       1.1 0.454
## 9       1.2 0.440
dat <- Teams %>%
  filter(yearID %in% 1961:2001) %>%
  mutate(BB_strata = round(BB/G, 1), 
         HR_per_game = HR/G, 
         R_per_game = R/G) %>%
  filter(BB_strata >= 2.8 & BB_strata <= 3.9)

dat %>%
  ggplot(aes(HR_per_game, R_per_game)) +
  geom_point(alpha=0.3) +
  geom_smooth(formula = y~x, method = "lm") +
  facet_wrap(~BB_strata)

dat %>%
  group_by(BB_strata) %>%
  summarise(slope = cor(HR_per_game, R_per_game) * sd(R_per_game) / sd(HR_per_game))
## # A tibble: 12 x 2
##    BB_strata slope
##        <dbl> <dbl>
##  1       2.8  1.52
##  2       2.9  1.57
##  3       3    1.52
##  4       3.1  1.49
##  5       3.2  1.58
##  6       3.3  1.56
##  7       3.4  1.48
##  8       3.5  1.63
##  9       3.6  1.83
## 10       3.7  1.45
## 11       3.8  1.70
## 12       3.9  1.30

Linear Models

Since Galton’s original development, regression has become one of the most widely used tools in data science. One reason for this has to do with the fact that [][regression permits us to find relationships between two variables while adjusting for others], as we have just shown for bases on balls and home runs. This has been particularly popular in fields where randomized experiments are hard to run (Think here, can we think other cases ??? and how can we apply it into our cases analyzing app users ???), such as economics and epidemiology. When we’re not able to randomly assign each individual to a treatment or control group, confounding is particularly prevalent.

For example, consider estimating the effect of any fast foods on life expectancy using data collected from a random sample of people in some jurisdiction. Fast food consumers are more likely to be smokers, drinkers, and have lower incomes (Your customer groups are already limited into a narrowed condition). Therefore, a naive regression model may lead to an overestimate of a negative health effect of fast foods. So how do we adjust for confounding in practice? We can use regression. [][We have described how, if data is bivariate normal, then the conditional expectation follow a regression line], that the conditional expectation as a line is not an extra assumption, but rather a result derived from the assumption, that they are approximately bivariate normal.

[][However, in practice it is common to explicitly write down a model that describes the relationship between two or more variables using what is called a linear model]. We know that linear here does not refer to lines exclusively, but rather to the fact that the conditional expectation is a linear combination of known quantities. Any combination that multiplies them by a constant and then adds them up with, perhaps, a shift. For example, 2 plus 3x minus 4y plus 5z is a linear combination of x, y, and z. So beta 0 plus beta 1x1, plus beta 2x 2 is a linear combination of x1 and x2. The simplest linear model is a constant beta 0. The second simplest is a line, beta 0 plus beta 1x.

For Galton’s data, we would denote n observed fathers’ heights with x1 through xn. Then we model n son heights we are trying to predict with the following model. Here, the little xi’s are the father’s heights, which are fixed not random, due to the conditioning. We’ve conditioned on these values. And then Yi big Yi is the random son’s height that we want to predict. We further assume that the errors that are denoted with the Greek letter for E, epsilon, epsilon i, are independent from each other, have expected value 0, and the standard deviation, which is usually called sigma, does not depend on i. It’s the same for every individual. We know the xi, but [][to have a useful model for prediction, we need beta 0 and beta 1. We estimate these from the data]. Once we do, we can predict the sons’ heights from any father’s height, x. Note that if we further assume that the epsilons are normally distributed, then this model is exactly the same one we derived earlier for the bivariate normal distribution. A somewhat nuanced difference is that in the first approach, we assumed the data was a bivariate normal, and [][the linear model was derived, not assumed].

In practice, linear models are just assumed without necessarily assuming normality. The distribution of the epsilons is not specified. But nevertheless, if your data is bivariate normal, the linear model that we just showed holds. If your data is not bivariate normal, then you will need to have other ways of justifying the model ([][Here, what is the other way to justifying the models ???]). One reason linear models are popular is that they are interpretable. In the case of Galton’s data, we can interpret the data like this. Due to inherited genes, the son’s height prediction grows by beta 1 for each inch we increase the father’s height x. Because not all sons with fathers of height x are of equal height, we need the term epsilon, which explains the remaining variability. This remaining variability includes the mother’s genetic effect, environmental factors, and other biological randomness.

Note that given how we wrote the model, the intercept beta 0 is not very interpretable, as it is the predicted height of a son with a father with no height. Due to regression to the mean, the prediction will usually be a bit larger than 0, which is really not very interpretable. To make the intercept parameter more interpretable, we can rewrite the model slightly in the following way. Here, we have changed xi to xi minus the average height x bar. We have centered our covariate xi. In this case, beta 0, the intercept, would be the predicted height for the average father for the case where xi equals x bar.

quotation form the book: Note that if we further assume that the ε is normally distributed, then this model is exactly the same one we derived earlier by assuming bivariate normal data. A somewhat nuanced difference is that in the first approach we assumed the data was bivariate normal and that the linear model was derived, not assumed. In practice, linear models are just assumed without necessarily assuming normality: the distribution of the εs is not specified. Nevertheless, if your data is bivariate normal, the above linear model holds. If your data is not bivariate normal, then you will need to have other ways of justifying the model.

[][Textbook link]

This video corresponds to the textbook section on linear models. https://rafalab.github.io/dsbook/linear-models.html

[][Key points]

    “Linear” here does not refer to lines, but rather to the fact that the conditional expectation is a linear combination of known quantities.

[][* In Galton’s model, we assume Y (son’s height) is a linear combination of a constant and X (father’s height) plus random noise. We further assume that Epsilon_i are independent from each other, have expected value 0 and the standard deviation Sigma which does not depend on i.*] Note that if we further assume that Epsilon is normally distributed, then the model is exactly the same one we derived earlier by assuming bivariate normal data. We can subtract the mean from X to make more Beta_0 interpretable.

linear combination of x, y and z

a horizontal line

second simplest linear mode have a slopel

random son heights we want to predict with conditioned fathers heights.png

In father-son heights dataset

So smart

Assessment: Introduction to Linear Models

Question 1

1/1 point (graded) As described in the videos, when we stratified our regression lines for runs per game vs. bases on balls by the number of home runs, what happened? The slope of runs per game vs. bases on balls within each stratum was reduced because we removed confounding by home runs. The slope of runs per game vs. bases on balls within each stratum was reduced because there were fewer data points. The slope of runs per game vs. bases on balls within each stratum increased after we removed confounding by home runs. The slope of runs per game vs. bases on balls within each stratum stayed about the same as the original slope. correct Answer Correct: Correct.

Question 2

1/1 point (graded)

We run a linear model for sons’ heights vs. fathers’ heights using the Galton height data, and get the following results:

   > lm(son ~ father, data = galton_heights)

Call: lm(formula = son ~ father, data = galton_heights)

Coefficients: (Intercept) father
35.71 0.50

Interpret the numeric coefficient for “father.” For every inch we increase the son’s height, the predicted father’s height increases by 0.5 inches. For every inch we increase the father’s height, the predicted son’s height grows by 0.5 inches. For every inch we increase the father’s height, the predicted son’s height is 0.5 times greater. correct

Explanation

The coefficient for “father” gives the predicted increase in son’s height for each increase of 1 unit in the father’s height. In this case, it means that for every inch we increase the father’s height, the son’s predicted height increases by 0.5 inches.

Question 3

1/1 point (graded)

We want the intercept term for our model to be more interpretable, so we run the same model as before but now we subtract the mean of fathers’ heights from each individual father’s height to create a new variable centered at zero.

  galton_heights <- galton_heights %>%
mutate(father_centered=father - mean(father))

We run a linear model using this centered fathers’ height variable.

  > lm(son ~ father_centered, data = galton_heights)

Call: lm(formula = son ~ father_centered, data = galton_heights)

Coefficients: (Intercept) father_centered
70.45 0.50

Interpret the numeric coefficient for the intercept. The height of a son of a father of average height is 70.45 inches. The height of a son when a father’s height is zero is 70.45 inches. The height of an average father is 70.45 inches. correct

Explanation

Because the fathers’ heights (the independent variable) have been centered on their mean, the intercept represents the height of the son of a father of average height. In this case, that means that the height of a son of a father of average height is 70.45 inches.

If we had not centered fathers’ heights to its mean, then the intercept would represent the height of a son when a father’s height is zero.

Question 4

1/1 point (graded)

Suppose we fit a multivariable regression model for expected runs based on BB and HR:

E[R|BB = x_1, HR = x_2] = Beta_0 + Beta_1 * x_1 + Beta_2 * x_2

Suppose we fix . Then we observe a linear relationship between runs and HR with intercept of:

Beta_0 Beta_0 + Beta_2 * x_2 Beta_0 + Beta_1 * x_1 Beta_0 + Beta_2 * x_1

correct

Explanation

If is fixed BB=x_1, then is fixed and acts as the intercept for this regression model. This is the basis of stratificaton.

Question 5

0.67/1 point (graded) Which of the following are assumptions for the errors Epsilon in a linear regression model?

Check ALL correct answers. The Epsilon are independent of each other correct The Epsilon have expected value 0 correct The variance of Epsilon is a constant correct partially correct

lm(son ~ father, data = galton_heights)
## 
## Call:
## lm(formula = son ~ father, data = galton_heights)
## 
## Coefficients:
## (Intercept)       father  
##     35.7125       0.5028
lm(formula = son ~ father, data = galton_heights)
## 
## Call:
## lm(formula = son ~ father, data = galton_heights)
## 
## Coefficients:
## (Intercept)       father  
##     35.7125       0.5028

Summary: R linear regression uses the lm() function to create a regression model given some formula, in the form of Y~X+X2. To look at the model, you use the summary() function.

library(dplyr)


galton_heights <- galton_heights %>%
    mutate(father_centered=father - mean(father))

lm(son ~ father_centered, data = galton_heights)
## 
## Call:
## lm(formula = son ~ father_centered, data = galton_heights)
## 
## Coefficients:
##     (Intercept)  father_centered  
##         70.4547           0.5028

Questions About Introduction to Linear Models

Ask your questions or make your comments about Introduction to Linear Models here! Remember, one of the best ways to reinforce your own learning is by explaining something to someone else, so we encourage you to answer each other’s questions (without giving away the answers, of course).

Some reminders:

    Search the discussion board before posting to see if someone else has asked the same thing before asking a new question
    Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
    Posting snippets of code is okay, but posting full code solutions is not.
    If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.

Course / Section 2: Linear Models / 2.2: Least Squares Estimates

Least Squares Estimates (LSE)

For linear models to be useful, we have to estimate the unknown parameters, the betas. The standard approach in science is to find the values that minimize the distance of the fitted model to the data. To quantify, this we use the [][least squares equation]. For Galton’s data, we would write something like this. This quantity is called the [][Residual Sum of Squares, RSS]. Once we find the values that minimize the RSS, we call the values the Least Squares Estimate, LSE, and denote them, in this case, with beta 0 hat and beta 1 hat.

[][Let’s write the function that computes the RSS for any pair of values, beta 0 and beta 1, for our heights data. It would look like this. So for any pair of values, we get an RSS. So this is a three-dimensional plot with beta 1 and beta 2, and x and y and the RSS as a z. To find the minimum, you would have to look at this three-dimensional plot]. Here, we’re just going to make a two-dimensional version by keeping beta 0 fixed at 25. So it will be a function of the RSS as a function of beta 1. We can use this code to produce this plot. We can see a clear minimum for beta 1 at around 0.65. So you could see how we would pick the least squares estimates. However, this minimum is for beta 1 when beta 0 is fixed at 25. But we don’t know if that’s the minimum for beta 0. We don’t know if 25 comma 0.65 minimizes the equation across all pairs.

[][****We could use trial and error, but it’s really not going to work here. Instead we will use calculus.****] We’ll take the partial derivatives, set them equal to 0, and solve for beta 1 and beta 0. Of course, if we have many parameters, these equations can get rather complex. But there are functions in R that do these calculations for us. We will learn these soon. To learn the mathematics behind this, you can consult the book on linear models.

quotation form the book: In mathematics, when we multiply each variable by a constant and then add them together, we say we formed a linear combination of the variables. For example, 3x−4y+5z is a linear combination of x, y, and z. We can also add a constant so 2+3x−4y+5z is also linear combination of x, y, and z.

[][Textbook link]

This video corresponds to the textbook section on least squares estimates. https://rafalab.github.io/dsbook/linear-models.html#lse

[][Key points]

    For regression, we aim to find the coefficient values that minimize the distance of the fitted model to the data.
    Residual sum of squares (RSS) measures the distance between the true value and the predicted value given by the regression line. The values that minimize the RSS are called the least squares estimates (LSE).
    We can use partial derivatives to get the values for and in Galton's data.
    NOTE: At timepoint 0:57 in the video, the Professor uses the terms and , but this should be and

Code

compute RSS for any pair of beta0 and beta1 in Galton’s data

library(HistData) data(“GaltonFamilies”) set.seed(1983) galton_heights <- GaltonFamilies %>% filter(gender == “male”) %>% group_by(family) %>% sample_n(1) %>% ungroup() %>% select(father, childHeight) %>% rename(son = childHeight) rss <- function(beta0, beta1){ resid <- galton_heights\(son - (beta0+beta1*galton_heights\)father) return(sum(resid^2)) }

plot RSS as a function of beta1 when beta0=25

beta1 = seq(0, 1, len=nrow(galton_heights)) results <- data.frame(beta1 = beta1, rss = sapply(beta1, rss, beta0 = 25)) results %>% ggplot(aes(beta1, rss)) + geom_line() + geom_line(aes(beta1, rss))

rss <- function(beta0, beta1, data){
  resid <- galton_heights$son - (beta0 + beta1 * galton_heights$father)
  return(sum(resid^2))
}


beta1 = seq(0, 1, len=nrow(galton_heights))
results <- data.frame(beta1 = beta1, 
                      rss = sapply(beta1, rss, beta0=25))

results %>% ggplot2::ggplot(aes(beta1, rss)) + geom_line() +
  geom_line(aes(beta1, rss), col=2)

[][Definition of trial and error: a finding out of the best way to reach a desired result or a correct solution by trying out one or more ways or means and by noting and eliminating errors or causes of failure also : the trying of one thing or another until something succeeds.]

The lm Function

[][In r, we can obtain the least squares estimates using the lm function]. To fit the following model where Yi is the son’s height and Xi is the father height, we would write the following piece of code. This gives us the least squares estimates, which we can see in the output of r. The general way we use lm is by using the tilde (~) character to let lm know which is the value we’re predicting that’s on the left side of the tilde, and which variables we’re using to predict– those will be on the right side of the tilde. The intercept is added automatically to the model. So you don’t have to include it when you write it.

The object fit that we just computed includes more information about the least squares fit. We can use the function summary to extract more of this information, like this. To understand some of the information included in this summary, we need to remember that the LSE are random variables. Mathematical statistics gives us some ideas of the distribution of these random variables. And we’ll learn some of that next. End of transcript. Skip to the start.

[][Textbook link]

This video corresponds to the textbook section on the lm function. https://rafalab.github.io/dsbook/linear-models.html#the-lm-function

[][Key points]

    When calling the lm() function, the variable that we want to predict is put to the left of the ~ symbol, and the variables that we use to predict is put to the right of the ~ symbol. The intercept is added automatically.
    LSEs are random variables.

Code

fit regression line to predict son’s height from father’s height

fit <- lm(son ~ father, data = galton_heights) fit

summary statistics

summary(fit)

fit <- lm(son ~ father, data = galton_heights)

fit
## 
## Call:
## lm(formula = son ~ father, data = galton_heights)
## 
## Coefficients:
## (Intercept)       father  
##     35.7125       0.5028
summary(fit)
## 
## Call:
## lm(formula = son ~ father, data = galton_heights)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9022 -1.4050  0.0922  1.3422  8.0922 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.71249    4.51737   7.906 2.75e-13 ***
## father       0.50279    0.06533   7.696 9.47e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.22 on 177 degrees of freedom
## Multiple R-squared:  0.2507, Adjusted R-squared:  0.2465 
## F-statistic: 59.23 on 1 and 177 DF,  p-value: 9.473e-13

LSE are Random Variables

[][The LSE are derived from the data, Y1 through Yn, which are random. This implies that our estimates are random variables]. To see this, we can run a Monte Carlo simulation (What would I do ???) in which we assume that the son and father height data that we have defines an entire population. And we’re going to take random samples of size 50 and compute the regression slope coefficient for each one. We write this code, which gives us several estimates of the regression slope.

We can see the variability of the estimates by plotting their distribution. Here you can see the histograms of the estimated beta 0’s and the estimated beta 1’s. [][The reason these look normal is because the central limit theorem applies here as well. For large enough N, the least squares estimates will be approximately normal with expected value beta 0 and beta 1 respectively].

The standard errors are a bit complicated to compute, but mathematical theory does allow us to compute them, and they are included in the summary provided by the lm function. Here are the estimated standard errors for one of our simulated data sets. You could see them at the second column in the coefficients table. [][You can see that the standard errors estimates reported by the summary function are close to the standard errors that we obtain from our Monte Carlo simulation].

The summary function also reports t-statistics–this is the t value column–and p-value. This is the Pr bigger than absolute value of t column. [][The t-statistic] is not actually based on the central limit theorem, but rather on the assumption that [][the epsilons follow a normal distribution]. Under this assumption, mathematical theory tells us that the LSE divided by their standard error, which we can see here and here, follow a t distribution with N minus p degrees of freedom, with p the number of parameters in our model, which in this case is 2.

The 2p values are testing the null hypothesis that beta 0 is 0 and beta 1 is 0 respectively.

[Go read this book][**https://www.statisticshowto.com/probability-and-statistics/null-hypothesis/]

[][Note that as we described previously, for large enough N, the central limit works, and the t distribution becomes almost the same as a normal distribution. So if either you assume the errors are normal and use the t distribution or if you assume that N is large enough to use the central limit theorem, you can construct confidence intervals for your parameters.] (How to construct confidnece intervals for our parameters)

We know here that although we will not show examples in this video, hypothesis testing for regression models is very commonly used in, for example, epidemiology and economics, to make statements such as the effect of A and B was statistically significant after adjusting for X, Y, and Z. But it’s very important to note that several assumptions–we just described some of them–have to hold for these statements to hold. [][the effect of A and B was statistically significant after adjusting for X, Y, and Z]

[][Textbook link]

This video corresponds to the textbook section on LSE. https://rafalab.github.io/dsbook/linear-models.html#lse-are-random-variables

[]Key points]

    Because they are derived from the samples, LSE are random variables.
    Beat_0 and Beta_1 appear to be normally distributed because the central limit theorem plays a role.
    The t-statistic depends on the assumption that follows a normal distribution.

Code

Monte Carlo simulation

B <- 1000 N <- 50 lse <- replicate(B, { sample_n(galton_heights, N, replace = TRUE) %>% lm(son ~ father, data = .) %>% .$coef }) lse <- data.frame(beta_0 = lse[1,], beta_1 = lse[2,])

Plot the distribution of beta_0 and beta_1

library(gridExtra) p1 <- lse %>% ggplot(aes(beta_0)) + geom_histogram(binwidth = 5, color = “black”) p2 <- lse %>% ggplot(aes(beta_1)) + geom_histogram(binwidth = 0.1, color = “black”) grid.arrange(p1, p2, ncol = 2)

summary statistics

sample_n(galton_heights, N, replace = TRUE) %>% lm(son ~ father, data = .) %>% summary %>% .$coef

lse %>% summarize(se_0 = sd(beta_0), se_1 = sd(beta_1))

library(dplyr)


B <- 1000
N <- 50

lse <- replicate(B, {
  sample_n(galton_heights, N, replace = TRUE) %>%
    lm(son ~ father, data = .) %>%
    .$coef
})


lse <- data.frame(beta_0 = lse[1, ], beta_1 = lse[2, ])

lse
##         beta_0    beta_1
## 1    24.761235 0.6652013
## 2    23.408611 0.6754664
## 3    39.590956 0.4418398
## 4    40.257861 0.4370058
## 5    37.769260 0.4791412
## 6    48.852680 0.3141427
## 7    26.363894 0.6330161
## 8    40.149363 0.4421090
## 9    30.543836 0.5730195
## 10   44.772027 0.3734241
## 11   47.586252 0.3278771
## 12   50.435436 0.2933249
## 13   39.327012 0.4510004
## 14   37.685957 0.4747525
## 15   49.274297 0.3027440
## 16   44.453285 0.3796486
## 17   37.543201 0.4713140
## 18   50.494464 0.2905305
## 19   40.448882 0.4399814
## 20   31.792789 0.5539521
## 21   31.994405 0.5605311
## 22   37.728437 0.4740665
## 23   28.455144 0.6103491
## 24   53.277303 0.2517847
## 25   42.524964 0.4077651
## 26   23.023526 0.6772903
## 27   27.528051 0.6334670
## 28   44.255802 0.3761496
## 29   30.309683 0.5826442
## 30   35.579065 0.4979352
## 31   37.579147 0.4736372
## 32   20.973121 0.7154093
## 33   31.691494 0.5616369
## 34   33.661399 0.5262346
## 35   30.515662 0.5781063
## 36   46.104208 0.3562353
## 37   46.136153 0.3557165
## 38   47.824759 0.3280022
## 39   44.334852 0.3866385
## 40   41.722349 0.4194209
## 41   35.712016 0.5083899
## 42   48.032733 0.3312928
## 43   31.769789 0.5583441
## 44   42.046428 0.4066494
## 45   32.806378 0.5454278
## 46   19.179541 0.7397832
## 47   46.596195 0.3450083
## 48   38.406158 0.4600023
## 49   28.267547 0.6160256
## 50   50.437479 0.2864995
## 51   54.677856 0.2262996
## 52   55.029476 0.2327189
## 53   13.922021 0.8250980
## 54   39.631745 0.4530760
## 55   31.558566 0.5665545
## 56   30.295765 0.5889797
## 57   41.152415 0.4324560
## 58   46.409551 0.3556439
## 59   38.103972 0.4670228
## 60   34.658250 0.5103990
## 61   48.221794 0.3268528
## 62   47.653270 0.3289157
## 63   32.593780 0.5517672
## 64   27.373784 0.6228244
## 65   35.416915 0.5093007
## 66   48.251784 0.3213564
## 67   55.406141 0.2227854
## 68   28.803941 0.5964665
## 69   42.964046 0.4022014
## 70   38.533538 0.4643678
## 71   34.670887 0.5174202
## 72   29.113246 0.5942298
## 73   39.429391 0.4399037
## 74   41.234321 0.4226820
## 75   21.206295 0.7070095
## 76   40.715146 0.4383258
## 77   29.492950 0.5863031
## 78   29.218867 0.5911182
## 79   45.546636 0.3683150
## 80   30.904472 0.5733892
## 81   46.229295 0.3534696
## 82   17.716152 0.7666421
## 83   55.564737 0.2240951
## 84   47.033789 0.3408944
## 85   57.330128 0.2064490
## 86   38.995175 0.4575181
## 87   36.147720 0.4954887
## 88   54.606703 0.2306413
## 89   47.338566 0.3358121
## 90   41.078035 0.4275592
## 91   24.918246 0.6590107
## 92   37.642780 0.4770577
## 93   30.363855 0.5757597
## 94   45.036540 0.3676316
## 95   35.282316 0.5077542
## 96   18.821588 0.7451613
## 97   25.051437 0.6568235
## 98   21.114840 0.7123054
## 99   25.502445 0.6487389
## 100  43.510341 0.3904007
## 101  41.425793 0.4167638
## 102  53.321573 0.2435885
## 103  31.384613 0.5482661
## 104  32.912581 0.5408704
## 105  23.811636 0.6771347
## 106  21.722043 0.7074961
## 107  23.787319 0.6872661
## 108  48.860029 0.3094834
## 109  34.825352 0.5136410
## 110  28.939304 0.5932794
## 111  38.881571 0.4611059
## 112  21.549406 0.6938620
## 113  26.950376 0.6291565
## 114  44.099300 0.3847768
## 115  30.603180 0.5789302
## 116  22.111055 0.6959182
## 117  15.538048 0.7987801
## 118  39.762965 0.4410947
## 119  41.301993 0.4245449
## 120  28.866718 0.5982834
## 121  34.785969 0.5117390
## 122  38.608381 0.4566679
## 123  29.229797 0.6039640
## 124  40.331823 0.4415892
## 125  45.544573 0.3573940
## 126  32.921453 0.5432085
## 127  39.322462 0.4509081
## 128  26.796898 0.6339156
## 129  40.615651 0.4315415
## 130  30.329732 0.5751830
## 131  24.634626 0.6648785
## 132  42.112729 0.4097841
## 133  30.947081 0.5707963
## 134  41.329874 0.4217503
## 135  24.722717 0.6695162
## 136  32.574575 0.5494584
## 137  43.533788 0.3887791
## 138  25.768354 0.6468326
## 139  36.209335 0.4990863
## 140  36.124632 0.4926432
## 141  26.881194 0.6338087
## 142  36.831701 0.4846555
## 143  19.229573 0.7389820
## 144  35.913748 0.4992556
## 145  34.655988 0.5142421
## 146  45.292969 0.3674189
## 147  19.877332 0.7361203
## 148  41.429052 0.4264932
## 149  36.541386 0.4877325
## 150  30.857806 0.5668627
## 151  24.824317 0.6655704
## 152  38.929627 0.4577086
## 153  40.809675 0.4268105
## 154  43.568151 0.4003547
## 155  26.466986 0.6313760
## 156  34.929231 0.5144192
## 157  14.019507 0.8132205
## 158  34.619195 0.5134083
## 159  29.672203 0.5926667
## 160  32.723820 0.5490834
## 161  31.547800 0.5560724
## 162  33.931573 0.5242690
## 163  31.816214 0.5591336
## 164  31.187895 0.5610644
## 165  40.523873 0.4336785
## 166  42.376796 0.4017869
## 167  37.182933 0.4836828
## 168  41.569240 0.4184869
## 169  22.503930 0.7053011
## 170  50.445663 0.2895361
## 171  30.647687 0.5844194
## 172  32.773480 0.5466842
## 173  17.085746 0.7746225
## 174  36.355346 0.4805195
## 175  33.934480 0.5272485
## 176  22.075927 0.6998038
## 177  19.017874 0.7493257
## 178  39.492049 0.4500878
## 179  47.000805 0.3329353
## 180  29.844327 0.5873774
## 181  42.859595 0.3966836
## 182  51.740380 0.2735582
## 183  35.987857 0.4993256
## 184  26.783237 0.6317175
## 185  46.197884 0.3513781
## 186  31.695074 0.5592611
## 187  33.235973 0.5417885
## 188  13.462502 0.8282826
## 189  23.919071 0.6643784
## 190  38.603361 0.4587108
## 191  47.434312 0.3350395
## 192  25.065559 0.6612981
## 193  37.411192 0.4812490
## 194  34.846576 0.5137608
## 195  27.217504 0.6248070
## 196  50.664199 0.2971869
## 197  34.768504 0.5105654
## 198  40.078316 0.4353889
## 199  37.832516 0.4688469
## 200  38.352415 0.4653320
## 201  45.567682 0.3577876
## 202  52.444987 0.2686171
## 203  49.456919 0.3117454
## 204  30.209287 0.5847140
## 205  42.897961 0.3977072
## 206  46.150410 0.3475697
## 207  42.164099 0.4128537
## 208  48.521832 0.3089837
## 209  48.945942 0.3146349
## 210  41.790252 0.4119813
## 211  29.989145 0.5896127
## 212  51.626078 0.2696251
## 213  43.160913 0.3904479
## 214  21.827511 0.7088841
## 215  44.431670 0.3861695
## 216  36.742469 0.4841952
## 217  20.919951 0.7214558
## 218  44.029585 0.3783057
## 219  40.056557 0.4384324
## 220  43.812221 0.3932342
## 221  62.354395 0.1120528
## 222  35.819603 0.5040757
## 223  27.523174 0.6302834
## 224  31.329737 0.5742365
## 225  26.658930 0.6375883
## 226  23.271175 0.6840656
## 227  25.648713 0.6460819
## 228  22.802762 0.6804166
## 229  31.571795 0.5659666
## 230  24.241892 0.6665913
## 231  30.866031 0.5665235
## 232  29.839902 0.5851041
## 233  29.761156 0.5807410
## 234  31.690980 0.5570091
## 235  49.339136 0.3041181
## 236  17.730786 0.7553891
## 237  45.143592 0.3611074
## 238  28.460208 0.6116255
## 239  36.368208 0.4933265
## 240  37.213163 0.4752778
## 241  30.973561 0.5683224
## 242  47.414413 0.3349761
## 243  40.212218 0.4329214
## 244  45.203485 0.3618661
## 245  34.462700 0.5173423
## 246  35.129903 0.5158525
## 247  48.566144 0.3094384
## 248  29.553050 0.5920616
## 249  24.755938 0.6673948
## 250  62.111351 0.1214628
## 251  39.064167 0.4476811
## 252  30.139857 0.5874548
## 253  42.906481 0.3954855
## 254  43.159519 0.3945215
## 255  35.001220 0.5075012
## 256  29.007827 0.5995097
## 257  39.889829 0.4455542
## 258  30.486503 0.5803475
## 259  22.830000 0.6901343
## 260  37.529152 0.4720310
## 261  32.923069 0.5432479
## 262  39.454377 0.4481591
## 263  17.462236 0.7679199
## 264  25.620524 0.6538687
## 265  39.279248 0.4555552
## 266  32.830313 0.5416876
## 267  25.858602 0.6452720
## 268  25.658418 0.6463878
## 269  16.687637 0.7810740
## 270  39.470860 0.4477153
## 271  36.914735 0.4865024
## 272  47.370948 0.3398707
## 273  20.510026 0.7235417
## 274  38.339182 0.4650274
## 275  26.118903 0.6424665
## 276  40.821283 0.4316922
## 277  35.656016 0.4977721
## 278  22.015304 0.7025305
## 279  41.961518 0.4111765
## 280  30.295058 0.5831575
## 281  36.546238 0.4983420
## 282  44.103170 0.3803782
## 283  26.734572 0.6323281
## 284  40.298370 0.4460093
## 285  23.407268 0.6815373
## 286  36.976593 0.4766512
## 287  28.604045 0.6091145
## 288  40.688870 0.4320425
## 289  50.684554 0.2893214
## 290  44.453722 0.3756438
## 291  25.640043 0.6506193
## 292  24.905528 0.6595138
## 293  27.404025 0.6260007
## 294  39.552322 0.4442072
## 295  56.298536 0.2073382
## 296  31.688743 0.5637635
## 297  32.518112 0.5497445
## 298  33.186546 0.5397095
## 299  35.808322 0.4975756
## 300  37.643789 0.4698759
## 301  38.286426 0.4718701
## 302  46.331144 0.3414991
## 303  18.460166 0.7564679
## 304  31.789846 0.5625830
## 305  37.232939 0.4801039
## 306  32.160913 0.5540557
## 307  36.883169 0.4908818
## 308  34.502568 0.5203052
## 309  37.909921 0.4749731
## 310  27.288380 0.6177868
## 311  37.704477 0.4762562
## 312  42.256427 0.4147087
## 313  38.265202 0.4630080
## 314  40.902802 0.4259775
## 315  47.942181 0.3220742
## 316  33.686798 0.5267783
## 317  37.288351 0.4760580
## 318  20.917617 0.7197465
## 319  45.681293 0.3648110
## 320  42.208108 0.4122335
## 321  44.865160 0.3826560
## 322  35.260418 0.5108774
## 323  46.172487 0.3478078
## 324  42.853366 0.4008693
## 325  32.262956 0.5611061
## 326  18.500223 0.7559089
## 327  37.942553 0.4783569
## 328  24.403952 0.6679043
## 329  18.683616 0.7526567
## 330  36.284946 0.4942166
## 331  55.438618 0.2272691
## 332   8.948180 0.8918736
## 333  39.318857 0.4518813
## 334  32.461105 0.5439634
## 335  53.938494 0.2420058
## 336  29.430168 0.5916276
## 337  44.845249 0.3651515
## 338  36.839532 0.4896401
## 339  37.854159 0.4763629
## 340  39.485477 0.4446064
## 341  33.475208 0.5368496
## 342  22.428681 0.6998995
## 343  25.840973 0.6446620
## 344  27.859370 0.6195495
## 345  51.406976 0.2764347
## 346  33.443078 0.5356262
## 347  43.140441 0.4017736
## 348  24.596829 0.6647545
## 349  29.083190 0.5993789
## 350  35.472663 0.5049742
## 351  20.585142 0.7248090
## 352  35.345660 0.5041319
## 353  35.664067 0.5091861
## 354  22.022044 0.7042783
## 355  24.460788 0.6655896
## 356  45.017227 0.3691801
## 357  40.530657 0.4304715
## 358  38.431583 0.4624739
## 359  48.794353 0.3131175
## 360  32.606714 0.5452288
## 361  34.395978 0.5251316
## 362  43.478145 0.3879541
## 363  33.767991 0.5337352
## 364  30.294334 0.5838009
## 365  35.999728 0.4955028
## 366  53.791676 0.2345344
## 367  38.784855 0.4558933
## 368  49.042310 0.3095742
## 369  45.622949 0.3591064
## 370  43.458924 0.3958410
## 371  30.660414 0.5832816
## 372  39.972705 0.4387541
## 373  51.045821 0.2829576
## 374  52.757191 0.2553949
## 375  53.380265 0.2426994
## 376  21.760480 0.7071425
## 377  30.625932 0.5786877
## 378  27.313775 0.6237240
## 379  29.068579 0.6018892
## 380  35.885553 0.4968412
## 381  30.080362 0.5897087
## 382  35.954631 0.4929993
## 383  35.690438 0.4955349
## 384  38.765170 0.4554346
## 385  35.012058 0.5119638
## 386  39.290175 0.4508934
## 387  38.840377 0.4588530
## 388  51.232753 0.2832944
## 389  28.467976 0.6046076
## 390  33.842716 0.5244022
## 391  28.456160 0.6124863
## 392  28.636708 0.6051981
## 393  23.172377 0.6867706
## 394  41.506840 0.4118360
## 395  28.389842 0.5982638
## 396  16.983161 0.7758637
## 397  31.829022 0.5549722
## 398  39.739541 0.4404537
## 399  56.156649 0.2059833
## 400  37.576839 0.4770654
## 401  43.009809 0.3985825
## 402  30.709269 0.5816898
## 403  38.441689 0.4637043
## 404  33.484862 0.5290246
## 405  42.652661 0.3963648
## 406  45.746017 0.3542848
## 407  24.084244 0.6719234
## 408  39.313518 0.4521619
## 409  36.132932 0.4930184
## 410  24.439258 0.6571412
## 411  43.492573 0.3849757
## 412  35.148930 0.5054413
## 413  28.769584 0.6018910
## 414  36.626401 0.4944295
## 415  27.390046 0.6310123
## 416  25.091268 0.6635109
## 417  14.732452 0.8075588
## 418  53.445858 0.2492230
## 419  52.041338 0.2704780
## 420  52.014345 0.2621251
## 421  32.325294 0.5531078
## 422  36.130275 0.4999672
## 423  28.295394 0.6062416
## 424  51.626203 0.2828845
## 425  30.316964 0.5818253
## 426  52.313342 0.2673916
## 427  28.023065 0.6201219
## 428  35.037560 0.5029537
## 429  46.420875 0.3487155
## 430  41.117234 0.4237541
## 431  39.836147 0.4483195
## 432  47.045400 0.3393185
## 433  46.255849 0.3487041
## 434  29.023746 0.6037970
## 435  37.881557 0.4710284
## 436  33.538811 0.5374839
## 437  40.550703 0.4332196
## 438  49.289131 0.3062066
## 439  29.885370 0.5984502
## 440  32.955341 0.5460642
## 441  43.570411 0.3839909
## 442  25.745948 0.6479943
## 443  26.425818 0.6365863
## 444  24.834148 0.6589649
## 445  47.051587 0.3426421
## 446  38.219097 0.4603135
## 447  30.194400 0.5847187
## 448  33.133676 0.5313822
## 449  45.909737 0.3549492
## 450  54.068305 0.2395535
## 451  43.061828 0.4007169
## 452  27.634488 0.6220205
## 453  39.547731 0.4501866
## 454  46.297855 0.3545815
## 455  49.257115 0.3041233
## 456  40.227328 0.4359684
## 457  40.977640 0.4267272
## 458  22.225886 0.7038171
## 459  37.540781 0.4769386
## 460  38.420240 0.4554006
## 461  36.058689 0.4968280
## 462  32.362111 0.5442985
## 463  49.206865 0.3137294
## 464  38.609708 0.4609836
## 465  26.452391 0.6334144
## 466  24.925452 0.6596129
## 467  40.179855 0.4352628
## 468  54.329175 0.2289732
## 469  32.979611 0.5445972
## 470  35.566174 0.4996060
## 471  23.909279 0.6752844
## 472  41.768938 0.4118794
## 473  53.541577 0.2444884
## 474  33.529493 0.5357082
## 475  47.420312 0.3267799
## 476  37.655916 0.4687214
## 477  35.884189 0.5054763
## 478  37.475433 0.4818629
## 479  33.777517 0.5259445
## 480  41.814944 0.4153856
## 481  43.140510 0.3985098
## 482  44.520017 0.3851580
## 483  42.053821 0.4076596
## 484  45.082472 0.3678046
## 485  37.956538 0.4756206
## 486  35.106788 0.5120413
## 487  28.449125 0.6008930
## 488  39.062609 0.4525576
## 489  31.632777 0.5634339
## 490  31.457669 0.5717639
## 491  35.153602 0.5173020
## 492  38.457880 0.4665440
## 493  31.180622 0.5703843
## 494  38.999556 0.4438980
## 495  31.776419 0.5569174
## 496  32.666297 0.5514412
## 497  43.671021 0.3877391
## 498  42.329832 0.4065908
## 499  41.204738 0.4247714
## 500  19.248046 0.7369539
## 501  41.955672 0.4127855
## 502  54.633517 0.2330226
## 503  38.100788 0.4697123
## 504  34.433411 0.5254131
## 505  38.302283 0.4660125
## 506  49.447941 0.3028749
## 507  23.036339 0.6887642
## 508  44.215107 0.3785448
## 509  33.160186 0.5392696
## 510  43.050334 0.3977634
## 511  32.191360 0.5573787
## 512  31.364024 0.5593680
## 513  29.783545 0.5885320
## 514  37.982250 0.4724471
## 515  39.588423 0.4388714
## 516  23.514915 0.6789995
## 517  16.708680 0.7748753
## 518  38.134111 0.4655517
## 519  37.821259 0.4764470
## 520  38.963779 0.4505650
## 521  33.713671 0.5284191
## 522  38.567254 0.4595398
## 523  33.226844 0.5463185
## 524  45.646832 0.3627509
## 525  35.595782 0.5093795
## 526  30.120926 0.5859836
## 527  42.698797 0.3980251
## 528  34.026545 0.5236635
## 529  34.767012 0.5149304
## 530  33.444171 0.5364366
## 531  31.917512 0.5587528
## 532  32.900711 0.5419085
## 533  32.181557 0.5540962
## 534  36.957681 0.4858408
## 535  31.873327 0.5573551
## 536  58.115807 0.1813797
## 537  41.494452 0.4140967
## 538  34.029184 0.5269960
## 539  39.585159 0.4422448
## 540  31.894598 0.5547269
## 541  33.821821 0.5301588
## 542  41.972917 0.4133443
## 543  13.522241 0.8331944
## 544  41.299492 0.4259939
## 545  38.481321 0.4631629
## 546  41.381376 0.4209778
## 547  38.688259 0.4667617
## 548  22.530443 0.7036252
## 549  33.107599 0.5478063
## 550  41.942973 0.4193876
## 551  26.363754 0.6439253
## 552  26.607205 0.6268112
## 553  22.562518 0.6943538
## 554  46.192251 0.3499978
## 555  41.622329 0.4162174
## 556  44.304090 0.3846029
## 557  28.348382 0.6166401
## 558  22.846978 0.6924239
## 559  32.910250 0.5390082
## 560  38.484612 0.4667172
## 561  29.765114 0.5844491
## 562  45.723591 0.3543705
## 563  41.522962 0.4180637
## 564  25.932689 0.6375436
## 565  45.047769 0.3569768
## 566  39.311640 0.4479466
## 567  37.188500 0.4794795
## 568  21.484415 0.7087592
## 569  35.875328 0.4972259
## 570  40.285474 0.4322407
## 571  21.588717 0.7035637
## 572  19.012495 0.7402098
## 573  43.902262 0.3848561
## 574  40.569357 0.4344921
## 575  35.085287 0.5140319
## 576  11.885256 0.8513925
## 577  27.033177 0.6221547
## 578  25.265835 0.6473310
## 579  48.508098 0.3259126
## 580  26.678610 0.6350648
## 581  26.353563 0.6425083
## 582  28.871121 0.5946144
## 583  31.792897 0.5552072
## 584  41.660027 0.4210855
## 585  38.311509 0.4711299
## 586  53.401266 0.2397124
## 587  20.829420 0.7229747
## 588  19.761739 0.7300562
## 589  16.518232 0.7837937
## 590  48.064767 0.3262122
## 591  21.842738 0.7044572
## 592  39.450639 0.4442805
## 593  24.700594 0.6673761
## 594  23.953461 0.6734285
## 595  42.838381 0.4014941
## 596  29.241549 0.5915855
## 597  34.560332 0.5244780
## 598  28.262616 0.6067339
## 599  31.639924 0.5625120
## 600  53.747374 0.2418374
## 601  24.826860 0.6659776
## 602  13.413061 0.8239313
## 603  35.310780 0.5098992
## 604  27.388882 0.6317373
## 605  10.951791 0.8631330
## 606  50.476727 0.2914339
## 607  37.192108 0.4822503
## 608  32.659984 0.5528157
## 609  32.668509 0.5437049
## 610  51.417413 0.2747194
## 611  35.181451 0.5139111
## 612  49.996291 0.3038111
## 613  44.229547 0.3818583
## 614  42.548735 0.4021667
## 615  60.726136 0.1471172
## 616  34.281112 0.5216780
## 617  35.112500 0.5134971
## 618  30.243071 0.5860825
## 619  42.845529 0.3963842
## 620  37.679784 0.4837922
## 621  34.696810 0.5182650
## 622  34.737210 0.5174637
## 623  43.039526 0.3951118
## 624  50.848277 0.2770348
## 625  38.420412 0.4660393
## 626  35.803809 0.5035719
## 627  39.363970 0.4389291
## 628  26.542171 0.6343561
## 629  27.074730 0.6228524
## 630  41.524506 0.4128010
## 631  38.645325 0.4673503
## 632  37.642881 0.4686498
## 633  48.818272 0.3160176
## 634  50.115638 0.3013302
## 635  41.245175 0.4217767
## 636  37.802249 0.4719579
## 637  17.608069 0.7680378
## 638  24.389795 0.6642802
## 639  22.596868 0.6823797
## 640  45.558960 0.3668538
## 641  35.738027 0.5061438
## 642  35.510049 0.5008046
## 643  51.964961 0.2759048
## 644  37.364481 0.4803738
## 645  33.244345 0.5423938
## 646  26.626319 0.6285345
## 647  27.438213 0.6235236
## 648  43.258274 0.3973459
## 649  43.055035 0.3943138
## 650  43.533304 0.3993239
## 651  40.799829 0.4316484
## 652  46.845199 0.3436348
## 653  42.489495 0.4103590
## 654  41.595999 0.4158991
## 655  41.327249 0.4232678
## 656  45.139504 0.3685821
## 657  30.235205 0.5742828
## 658   5.475325 0.9234643
## 659  36.198676 0.4990222
## 660  45.794493 0.3570741
## 661  25.194090 0.6660501
## 662  27.107065 0.6185993
## 663  48.915359 0.3090515
## 664  21.840462 0.6977580
## 665  30.203723 0.5821646
## 666  24.258676 0.6715053
## 667  45.700002 0.3585918
## 668  25.733684 0.6371966
## 669  44.343905 0.3861682
## 670  35.171390 0.5108172
## 671  15.031364 0.7945004
## 672  28.418641 0.6055336
## 673  31.205019 0.5708526
## 674  25.539021 0.6523065
## 675  27.619381 0.6179100
## 676  29.543402 0.5862789
## 677  30.028083 0.5792390
## 678  51.901902 0.2759411
## 679  45.454411 0.3659595
## 680  25.023224 0.6571935
## 681  28.136276 0.6118345
## 682  47.151115 0.3306319
## 683  39.264134 0.4543170
## 684  30.477685 0.5797604
## 685  32.932652 0.5431444
## 686  45.804547 0.3528293
## 687  41.695307 0.4147138
## 688  42.991992 0.3978577
## 689  55.360629 0.2207596
## 690  31.936113 0.5540199
## 691  41.558903 0.4123341
## 692  21.910361 0.7037284
## 693  47.032904 0.3415916
## 694  53.360636 0.2483124
## 695  43.086931 0.3952231
## 696  28.322532 0.6054613
## 697  33.022693 0.5437274
## 698  36.792900 0.4985893
## 699  46.140775 0.3509254
## 700  31.680766 0.5563655
## 701  29.515663 0.5937972
## 702  53.264620 0.2509298
## 703  36.766962 0.4850564
## 704  11.336069 0.8541524
## 705  58.646924 0.1717909
## 706  39.509013 0.4386140
## 707  26.136891 0.6477473
## 708  31.372455 0.5615399
## 709  31.369041 0.5621068
## 710  32.040716 0.5536944
## 711  28.030345 0.6204069
## 712  33.681835 0.5402823
## 713  45.432715 0.3634527
## 714  29.751839 0.5971339
## 715  33.753351 0.5349484
## 716  34.481805 0.5147303
## 717  46.664114 0.3457946
## 718  40.342989 0.4356511
## 719  35.882989 0.4929805
## 720  27.686665 0.6201725
## 721  39.609510 0.4454116
## 722  40.639481 0.4332592
## 723  21.634244 0.7073496
## 724  22.522132 0.6924437
## 725  38.992225 0.4596368
## 726  42.589739 0.4027033
## 727  34.344635 0.5183097
## 728  40.112199 0.4315983
## 729  29.563152 0.5821586
## 730  35.243823 0.5148497
## 731  33.664531 0.5368840
## 732  42.713881 0.3995645
## 733  22.907814 0.6928171
## 734  39.442375 0.4516807
## 735  33.152040 0.5404171
## 736  35.810092 0.5076442
## 737  47.764409 0.3238779
## 738  47.232191 0.3315306
## 739  18.258669 0.7614196
## 740  34.424521 0.5196268
## 741  47.156731 0.3397601
## 742  48.916268 0.3191742
## 743  35.040260 0.5138130
## 744  33.943592 0.5211718
## 745  27.697781 0.6189014
## 746  23.008479 0.6844564
## 747  36.164305 0.4913911
## 748  25.283106 0.6523869
## 749  25.703029 0.6494064
## 750  47.864665 0.3233902
## 751  39.026711 0.4612851
## 752  45.257763 0.3676875
## 753  29.004706 0.5943493
## 754  39.711321 0.4419147
## 755  39.553684 0.4548616
## 756  43.041118 0.4002745
## 757  44.295326 0.3826400
## 758  47.394652 0.3315254
## 759  40.977421 0.4369279
## 760  25.403477 0.6582644
## 761  34.605767 0.5084079
## 762  35.309929 0.5074408
## 763  28.056814 0.6202203
## 764  35.240223 0.5095843
## 765  40.683325 0.4339111
## 766  31.304224 0.5742837
## 767  27.117252 0.6250865
## 768  26.789005 0.6276499
## 769  38.091872 0.4684591
## 770  38.795443 0.4617422
## 771  36.169312 0.4896413
## 772  51.529177 0.2657342
## 773  45.207910 0.3684663
## 774  43.775318 0.3886454
## 775  34.170344 0.5207068
## 776  34.111667 0.5292314
## 777  23.767304 0.6739516
## 778  27.927828 0.6144267
## 779  41.300090 0.4271441
## 780  33.277893 0.5413291
## 781  41.393416 0.4219124
## 782  39.550178 0.4513205
## 783  37.157810 0.4819329
## 784  47.033752 0.3442748
## 785  34.753037 0.5189023
## 786  60.917216 0.1466978
## 787  59.575413 0.1614615
## 788  43.584494 0.3870528
## 789  27.384668 0.6234883
## 790  50.409548 0.2866961
## 791  33.339330 0.5375482
## 792  35.778518 0.4980294
## 793  39.514620 0.4471666
## 794  30.800584 0.5719196
## 795  16.019216 0.7958120
## 796  47.244062 0.3330449
## 797  19.361691 0.7388108
## 798  34.458279 0.5200512
## 799  35.397609 0.5027617
## 800  46.660007 0.3448484
## 801  37.594874 0.4691182
## 802  28.963993 0.6001394
## 803  47.301512 0.3368936
## 804  38.623844 0.4588612
## 805  33.295544 0.5325633
## 806  39.357093 0.4461292
## 807  34.141061 0.5227010
## 808  43.585856 0.3884028
## 809  35.236584 0.5124872
## 810  27.082486 0.6213682
## 811  36.442577 0.4999627
## 812  37.847521 0.4731392
## 813  20.216795 0.7275523
## 814  37.725849 0.4661805
## 815  42.730203 0.4123664
## 816  27.126785 0.6314131
## 817  29.621837 0.5934311
## 818  51.252955 0.2743294
## 819  31.208000 0.5637825
## 820  42.204207 0.4149933
## 821  52.966847 0.2493603
## 822  49.449266 0.3067322
## 823  40.686503 0.4314699
## 824  26.746566 0.6302117
## 825  33.166662 0.5316276
## 826  30.616184 0.5747311
## 827  35.161318 0.5072940
## 828  45.644422 0.3614246
## 829  37.614125 0.4712120
## 830  26.408024 0.6360444
## 831  16.641112 0.7744872
## 832  34.426620 0.5236156
## 833  31.007574 0.5768185
## 834  46.307108 0.3384145
## 835  29.111110 0.6003546
## 836  52.556314 0.2624879
## 837  38.744655 0.4574707
## 838  40.878487 0.4312522
## 839  42.258485 0.4071535
## 840  40.610038 0.4332207
## 841  38.957963 0.4515249
## 842  41.175587 0.4235679
## 843  16.409434 0.7814942
## 844  42.744112 0.3991657
## 845  25.823906 0.6410981
## 846  36.410601 0.4923834
## 847  19.778900 0.7356202
## 848  40.601850 0.4338018
## 849  37.514614 0.4771177
## 850  30.489247 0.5739221
## 851  46.707106 0.3402801
## 852  31.235021 0.5630002
## 853  27.231756 0.6184112
## 854  27.249092 0.6170669
## 855  30.457233 0.5803017
## 856  43.391078 0.3903895
## 857  33.248592 0.5356131
## 858  50.209683 0.2894555
## 859  55.628774 0.2147826
## 860  38.923765 0.4596296
## 861  35.368590 0.5134414
## 862  45.348529 0.3588737
## 863  23.441190 0.6741824
## 864  30.061574 0.5826753
## 865  28.096172 0.6145742
## 866  43.279705 0.3944298
## 867  33.597102 0.5332886
## 868  37.996361 0.4608771
## 869  31.751661 0.5544182
## 870  38.158680 0.4686046
## 871  52.543202 0.2668262
## 872  31.094291 0.5731196
## 873  36.920210 0.4788621
## 874  42.877360 0.3903816
## 875  34.791163 0.5134179
## 876  32.246842 0.5548603
## 877  33.686370 0.5315748
## 878  34.285801 0.5226861
## 879  20.298241 0.7339007
## 880  45.551600 0.3660524
## 881  29.029073 0.5951690
## 882  29.960518 0.5887226
## 883  39.287514 0.4529075
## 884  43.093548 0.4060994
## 885  32.184076 0.5508140
## 886  52.456703 0.2591710
## 887  39.242899 0.4483977
## 888  39.412101 0.4477090
## 889  29.429853 0.5897865
## 890  49.293349 0.3039222
## 891  31.460745 0.5563730
## 892  30.782111 0.5775839
## 893  37.642789 0.4701762
## 894  37.860882 0.4712522
## 895  34.917650 0.5163155
## 896  44.402234 0.3796206
## 897  38.334149 0.4773405
## 898  27.383721 0.6221813
## 899  45.241341 0.3636643
## 900  31.615858 0.5547873
## 901  27.451575 0.6141987
## 902  23.913787 0.6670263
## 903  40.335903 0.4334473
## 904  45.567896 0.3547771
## 905  29.204239 0.6003031
## 906  34.583275 0.5210268
## 907  26.518184 0.6349341
## 908  42.546353 0.4106298
## 909  12.675765 0.8386700
## 910  33.371487 0.5305238
## 911  43.669706 0.3959404
## 912  49.391179 0.3036947
## 913  33.437734 0.5429480
## 914  44.733237 0.3696720
## 915  48.369720 0.3238886
## 916  34.733499 0.5195640
## 917  36.474358 0.4988843
## 918  42.766116 0.4030942
## 919  34.906498 0.5177179
## 920  17.806716 0.7636134
## 921  33.441432 0.5367922
## 922  19.600271 0.7371223
## 923  24.492571 0.6688931
## 924  36.532573 0.4966343
## 925  31.920209 0.5565127
## 926  37.572266 0.4789719
## 927  45.342888 0.3662055
## 928  48.219620 0.3189892
## 929  32.507767 0.5513250
## 930  48.460746 0.3174256
## 931  37.971718 0.4660568
## 932  48.662913 0.3159935
## 933  38.465358 0.4574914
## 934  47.483715 0.3338123
## 935  31.444658 0.5658434
## 936  38.093110 0.4648152
## 937  40.288980 0.4347673
## 938  49.562743 0.3024326
## 939  41.335438 0.4226862
## 940  20.045818 0.7333204
## 941  34.510062 0.5192500
## 942  36.891700 0.4889460
## 943  11.207720 0.8630764
## 944  32.158114 0.5409103
## 945  44.715103 0.3780888
## 946  21.103776 0.7185816
## 947  45.844653 0.3583079
## 948  30.267205 0.5809482
## 949  25.172784 0.6530312
## 950  27.979158 0.6200325
## 951  25.740584 0.6493779
## 952  57.537302 0.1879711
## 953  43.972764 0.3895099
## 954  39.928298 0.4426903
## 955  33.106229 0.5460679
## 956  28.947925 0.5967626
## 957  30.276251 0.5785571
## 958  32.782427 0.5444891
## 959  37.189792 0.4807883
## 960  25.354011 0.6464440
## 961  23.079096 0.6792942
## 962  46.098156 0.3529729
## 963  29.729840 0.5879936
## 964  42.031700 0.4108365
## 965  43.473158 0.3927300
## 966  17.826837 0.7673638
## 967  32.798854 0.5439647
## 968  48.858982 0.3030275
## 969  32.273032 0.5486245
## 970  47.771994 0.3300475
## 971  41.663815 0.4203051
## 972  30.377744 0.5852970
## 973  31.328046 0.5626058
## 974  29.678305 0.5922649
## 975  46.959695 0.3467064
## 976  34.834085 0.5218610
## 977  25.384821 0.6460894
## 978  27.305842 0.6142386
## 979  43.759266 0.3815911
## 980  49.094954 0.3163621
## 981  38.833912 0.4519983
## 982  28.390490 0.6168319
## 983  12.623383 0.8305889
## 984  29.979094 0.5862753
## 985  38.112142 0.4671512
## 986  46.420327 0.3465867
## 987  37.749459 0.4666668
## 988  42.338694 0.4064248
## 989  42.855193 0.4009612
## 990  31.328653 0.5649896
## 991  46.015020 0.3604746
## 992  25.517787 0.6563255
## 993  41.299352 0.4205835
## 994  32.714784 0.5422310
## 995  29.446104 0.5988365
## 996  37.524756 0.4765147
## 997  44.150577 0.3820690
## 998  26.651991 0.6404440
## 999  55.167129 0.2151089
## 1000 37.117620 0.4869950
library(dplyr)
library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
p1 <- lse %>% ggplot2::ggplot(aes(beta_0)) + geom_histogram(binwidth = 5, color = "black")
p2 <- lse %>% ggplot2::ggplot(aes(beta_1)) + geom_histogram(binwidth = 0.1, color = "black")

grid.arrange(p1, p2, ncol = 2)

sample_n(galton_heights, N, replace = TRUE) %>%
  lm(son ~ father, data = .) %>% summary
## 
## Call:
## lm(formula = son ~ father, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5295 -1.4107  0.4784  1.4729  6.4768 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  24.3381     8.6866   2.802  0.00731 ** 
## father        0.6646     0.1252   5.309 2.79e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.39 on 48 degrees of freedom
## Multiple R-squared:   0.37,  Adjusted R-squared:  0.3569 
## F-statistic: 28.19 on 1 and 48 DF,  p-value: 2.789e-06
lse %>% summarise(se_0 = sd(beta_0), se_1 = sd(beta_1))
##       se_0      se_1
## 1 9.312195 0.1345212

you can see the standard error estimates reported by the summary function are close to the standard errors that we obtained from our monte carlo simulation.png

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

hypothesis testing for regression models is very commonly used in many areas as the effecta and b was statistically significant after adjusting for x y and z.png

Advanced Note on LSE

Although interpretation is not straight-forward, it is also useful to know that the LSE can be strongly correlated, which can be seen using this code:

lse %>% summarize(cor(beta_0, beta_1))

However, the correlation depends on how the predictors are defined or transformed.

Here we standardize the father heights, which changes x_i to x_i - x hat

.

B <- 1000 N <- 50 lse <- replicate(B, { sample_n(galton_heights, N, replace = TRUE) %>% mutate(father = father - mean(father)) %>% lm(son ~ father, data = .) %>% .$coef })

Observe what happens to the correlation in this case:

cor(lse[1,], lse[2,])

lse %>% summarize(cor(beta_0, beta_1))
##   cor(beta_0, beta_1)
## 1           -0.999419
B <- 1000
N <- 50
lse <- replicate(B, {
      sample_n(galton_heights, N, replace = TRUE) %>%
      mutate(father = father - mean(father)) %>%
      lm(son ~ father, data = .) %>% .$coef 
})


cor(lse[1,], lse[2,]) 
## [1] -0.1459641